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INTRODUCTION 

The  charge  of  AFOSR  Grant  F49620-79-C-0066  was  the  study  of  a  new  class 
of  memory  intensive  digital  arithmetic  units  based  on  modular  algebra.  The 
n ew  class  of  arithmetic  units,  developed  under  this  grant,  operate  at  very 
high  speeds,  admit  VLSI  and  bit-slice  realizations,  and  can  be  integrated 
into  digital  signal  processing  systems. 

Numerous  authors  have  demonstrated  the  potential  of  residue  arithmetic 
for  realizing  high  speed  signal  processing  and  computational  systems.^ 

These  methods  are  memory  intensive  in  that  the  table  lookup  operations  are 
used  to  perform  modular  arithmetic.  However,  there  is  a  possible  flaw  in 
this  contemporary  residue  arithmetic  work  and  it  is  our  dependence  on  high 
speed  memory.  Admittedly,  memory  is  becoming  available  with  higher  densities 
and  access  speeds.  However,  they  present  a  non- trivial  power  demand  on  the 
system  and  are  very  expensive.  For  example,  Intel's  HMOS  IK  x  4  memory,  having 
access  times  of  55,  70,  and  80  ns  cost  on  the  order  of  $82,  $76,  and  $62  per 
copy.  INMOS  16K  (4k  x  4)  static  RAH  is  available  at  35  ns  and  Fairchild  markets 
a  IK  ECL  (high  power  dissipation)  RAM  at  higher  per  unit  costs^.  Our  research 
has  determined  that  by  using  K  x  1  HMOS  memories,  an  equivalent  12  x  12  RNS 
multiplier  could  be  configured  which  has  a  pipelined  throughput  of  35  ns,  but 
it  would  require  9.9  watts  of  active  power  and  1.65  watts  standby.  Furthermore, 
the  cost  per  moduli  would  exceed  $1,000.  Therefore,  high  performance  residue 
based  signal  processing  systems  may  carry  a  high  Drice  tag  as  well.  It  may 
therefore  be  wise  to  rethink  our  dependence  on  memory  intensive  arithmetic. 

Footnote  [i]:  This  condition  will  be  strongly  influenced  by  the  results  of 
DOD's  VHSIC  Program. 
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It  would  seem  advantageous  to  architect  future  residue  arithmetic 
based  systems  on  those  technologies  which  will  provide  the  highest  performance 
in  terms  of: 

1 .  speed 

2.  cost 

3.  power  dissipation 

4.  packaging 

5.  availability 

6.  system  compatabil ity 


parameters.  The  last  two  parameters  are  unfortunately  often  neglected  in 
exploratory  research  efforts.  It  would  reflect  poor  engineering  practice 
to  develop  a  technology  dependent  theory  which  is  incompa table  with  it's 
electronic  environment.  The  technology  which  seems  to  yield  the  greatest 
promise  is  the  VLSI . ^ High  performance  arithmetic  units  are  already 
available  in  VLSI.  For  example,  the  TRW-VLSI  carry-save  2's  complement 
multiplier  line  breaks  down  as  follows: 

TABLE  1  _ _ 


UNIT 

SIZE 

PINS 

SPEEU(ns) 

POWER  (watts) 

MPY8HJ-1 

8x8 

40 

45 

1.5 

MPY-12HJ 

12  x  12 

64 

80 

2.7 

MPY-16HJ 

16  x  16 

64 

100 

4.0 

MPY-24IIJ 

24  x  24 

64 

200 

5.0 

By  comparison,  the  12  x  12  35  ns  RNS  multiplier  is  more  than  twice  as  fast 
as  the  VLSI  unit  but  consumes  more  than  3.5  times  the  power.  However,  the 
above  VLSI  multiplier  units  are  designed  to  work  in  2's  complement  and 

Footnote  [i]:  Small  Scale  Integration  [SSI]  =  50  gates,  MSI  =  50-100  gates 
VLSI  =  4000  gates/chip 
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therefore  do  not  support  residue  arithmetic  directly.  Since  these  basic 
fixed  point  2 ' s  complement  VLSI  multipliers  offer  outstanding  performance 
tor  the  price,  it  is  desirable  to  integrate  them  into  a  residue  number 
system  (RNS)  structure. 

RESIDUE  ARITHMETIC 

Before  a  meaningful  dislog  on  residue  arithmetic  units  and  systems 
can  be  established,  the  fundamental  properties  of  this  numbering  system 
should  be  reviewed. 

Residue  number  system  (RNS)  is  mature  mathematical  study.  A  serious 
study  of  the  RNS  was  offered  by  Gauss  in  the  19th  century.  In  1968  Szabo 
and  Tanaka  published  the  book  "Residue  Arithmetic  and  Its  Applications  to 
Computer  Technology". ^  Due  to  the  technological  limitations  of  the  period, 
the  book  did  not  receive  wide-spread  v-ecognition  and  was  soon  out  of  print. 
However,  due  to  the  recent  availability  of  high  density  high  performance 
Read  Only  Memory  (ROM)  and  Random  Access  Memory  (RAM),  the  RNS  is  being  re¬ 
investigated  for  the  application  in  digital  filter  design,  implementation 
of  fast  transforms,  convolution,  and  optical  computation. 

Let  P=(p1 ,p2,- • • ,P( )  be  a  set  of  relatively  prime  integers,  and  let  x 

be  any  integer  in  [0,  M-l]  (called  dynamic  range)  where  M  =  Up..  Then 

i-1  ’ 

by  the  Eucledian  algorithm,  there  exists  k^  ,x.j£l( integers) ,  such  that 

x  =  k.p.+x-  i-1 ,2,. . . ,L  1 . 

The  quantity  is  called  the  ith  residue  of  x,  and  is  usua7ly  denoted  as 
|x|p  or  x  mod  p. . 
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It  is  easy  to  show  that  x  and  M+x  have  the  same  residue  representation. 
Only  if  xe[0,  M-1],  can  x  be  uniquely  determined  by  the  L-tuple  (x-j  ,x2,- •  •  ,x, ) . 
In  this  case  denote  x=(x-|  ,x2,. ■  *  »x^) . 

Another  signed  encoding  scheme  can  also  be  used.  In  this  case,  the 
dynamic  range  is  [-(M-l)/2,  (M-l)/2]  with  a  negative  number  -|xj  coded  as 
M-jxj.  There  is  a  trivial  the  isomorphism  wnich  maps  [0,  M-1]  onto  [-(M-l)/2, 
(M- 1  )/2] .  This  second  coding  scheme  has  the  advantage  that  sign  detection 
is  not  required  during  arithmetic  operations  and  the  sign  of  the  result  will 
take  care  of  itself  providing  that  no  overflow  (out  of  dynamic  range)  had 
occurred. 

The  following  are  some  identities  which  will  be  used  later.  The  proofs 
are  straight  forward  and  will  not  be  presented. 

ix±yip  =  MplMp  2- 

i xy i p  =  |lxIplylp|p  .3* 

Mp  =  Ip-*Ip  4- 

Let  Zp  be  the  set  of  integers  x  such  that  0<x<p  (ie:  residue  class). 

It  is  well  known  that  Zp  is  an  abelian  ring  under  addition  and  multiplication 
modulo  p.  For  any  integer  xeZp,  the  inverse  of  x  is  the  integer  yeZp  such 
that  | xy | p  =1.  It  is  also  known  that  if  x  is  relatively  prime  to  p,  then  x 
has  an  unique  inverse,  denoted  x~^[pj.  For  example  in  Zg,5"^[6]=5. 

Arithmetic  operations  in  RNS  are  defined  in  a  straightforward  manner. 

Let  x,  yeZ,  x,  ye[0,  M-1]  and  x=(x-j  ,x2,. ..  ,xL) ,  y=(y-j  ,y2,.  •  •  ,yL)  •  Then 
z=xoy=(z-| ,z2,. . . ,2^)  where  z.=(x.jOy.j)  mod  p^,  for  i=l,2,...,L,  and  "o"  denote 
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the  operation  x,  +  or  -. 

It  is  clear  that  the  sub-operation  within  each  modulus  is  independent 
to  each  other.  That  is,  no  carry  information  is  necessary  between  moduli. 

The  arithmetic  is  also  exact  and  therefore  free  of  round-off  error.  The  z 
is  exact  if  0<z<M~l ,  however  if  xoy>M  (overflow)  then  the  answer  will  be 
incorrect.  Hence  it  is  critical  to  know  beforehand  that  the  result  will 
not  exceed  the  finite  RNS  dynamic  range.  Division  in  RNS  is  known  to  be 
difficult.  Therefore  RNS  is  considered  to  be  best  applied  to  system  where 
division  is  not  the  dominant  operations. 

Another  RNS  induced  scheme  is  called  the  mixed-radix  numbering  system 
(MRNS).  Given  the  moduli  set  P^p^ ,p2,. . . ,pl) ,  any  integer  xe[G,  M-l]  can 
be  expressed  uniquely  as 

x-x-j  +x2p  -|+x-P-| P2+-  *  P"i  ?2'  *  * PL-1^  ^  * 

i-1 

let  q,=l  and  q.=  UP-,  eo.  (5)  can  be  written  as 
I  1  j=l  J 

x=x-jq-|+X2q2+X3Q3* - +-\\  6. 

or  equivalently,  x  can  be  represented  uniquely  by  the  L-tuple  x=<XpX2, - ,xL>. 

The  x-'s  are  called  the  mixed  radix  digits  with  0<x.<p.-l.  The  mixed-radix 
number  system  is  a  weighted  number  system.  Therefore  carries  between  digits 
are  necessary  in  arithmetic  operations.  A  property  of  a  weighted  number  system 
is  that  magnitude  comparison  is  trivial. 

It  is  often  necessary  to  compute  the  M.R.  digits  <x-j,x2» - ,xt> 

from  given  set  residue  digits  (x-j  - >*L).  Here 


Hence, 


ix|prxr,?1+Vi+**-+\.niPi!p.=jx1fPi=xi 

After  subtracting  x]  from  both  sides  of  eq.  (7),  one  obtaii 


x-x^x  p +Jf  “n  P.  9 

Li=l  1  a- 

1  x_iEd  p/  I  *2Pi +-  •  •  •  +\  j”’  Pi  I  p2=  I  Vt ,  „2 

Upon  multiply  both  sides  by  which  exists  by  the  relatively  prime 

property  of  p-j  and  p2>  one  obtains 

l(x~x1)P1  [P2]lp2=lx2P]P1  1[P23Jp2=|x2Jp2=x2  10. 

This  process  can  be  carried  out  successively  until  all  x.'s  are 
obtained.  Actually,  the  iterate  process  can  be  realized  in  parallel  form 
due  to  the  independence  of  residues.  An  algorithm  found  in  Szabo  and 
Tanaka^  can  be  used. 

It  was  noted  that  division  is  difficult  in  RNS.  However,  in  the  case 
that  the  divisor  is  a  fixed  constant  c  (where  c  is  relatively  prime  to 
P1-,i*l,....L),  there  is  known  to  exist  some  simplification  of  the  scaling 
task.  The  scaling  operation  is  formly  defined  as  follows:  Given 
P=(PPP2,....,PL}  and  x=(x1,x2,....,x,),  what  is  the  residue  representation 


of  |  —  j  ?  (where  j  |  denotes  the  rounding  to  the  closest  integer  operation.) 
From  the  Eucledian  Algorithm,  namely 


it  follows  that 


x-l*!, 


11. 


which  is  of  integer  value.  The  residue  representation  cf  this  integer  is 
given  in  terms  of  a  "scaling  kernel"  satisfying. 


x 

c 


HI 


Hx  1 


.)c'1[pi]|p.=  l(xr!x|cic-1[pi]| 


12. 


X 

Thus,  if  jxj  is  known ,  then  the  residue  representation  of  1  —  S  can  be 
obtained  using  one  subtraction  and  one  multiplication.  Since  c  are  relatively 
prime,  c_1[p-J  exists  w.r.t.  p^.  Usually  { x ! c  will  not  be  given  and  nave  to 
be  found  by  a  base  extension  algorithm. 

The  integer  value  of  a  residue  representation  can  also  be  obtained 
through  the  use  of  Chinese  Remainder  Theorem. 

Given  P  =  (p1,P2>--*-»PL);  where  p- ,  p,-  are  relatively  prime  for  i  f  j, 
the  CRT  states; 


Theorem  (Chinese  Remainder  Theorem) 


i 

i 


x 


M  _ 


13. 


where 
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Pi 


n  P,  and  |m.m.  [pJL  =1 
j=1  J  11  1  Pi 


13. 


Proof:  Since  a  residue  number  represents  an  integer  uniquely  in  the  dynamic 
range  [0,  M-l].  It  is  enough  to  show  that  the  right-hand  side  of  eq.  (13) 
has  residues  (x-j  ,x2,. . . .  ,x^) .  Since 


M 


L  _] 

CPf3lp 


po  pj 


i=|xjiP(=xj 


The  claim  follows. 

*Notice  that  the  left-hand  side  of  eq.  (13)  is  in  the  form  of  [xj^.  That  is, 
the  resulting  integer  will  be  unique  if  0<x<M. 

There  is  yet  another  method  which  may  be  used  to  decode  a  residue  tuple. 
This  method  has  been  independently  reported  by  Jenkin^  and  Julian. ^ 

Starting  from  the  residue  representation  (x-j  ,x2,. . . .  ,x^) ,  <x-j,x2, - ,x^> 

is  obtained  through  a  M.R.  conversion.  Then  eq.  (5)  will  be  used  to  recon¬ 
struct  x.  This  method  is  called  M.R.  reconstruction. 


RNS  CAPABILITIES 

Interest  in  the  RNS  is  due  to  its  ability  to  perform  high-speed  arithmetic. 
Speed  is  achieved  through  the  use  of  a  high  degree  of  parallelism  and  an 
absence  of  carry  information  requirements.  Recall  that  the  arithmetic  composi¬ 
tion  of  two  integers,  say  x-Kx-j ,. . .  ,xL)  and  y-Ky-j ,. . .  ,yL) ,  given  by  x  y  (where 
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o  denotes  addition,  subtraction,  or  multiplication)  satisfies  x°y-(x^y^, 

- ^  can  seen  eac^  roSl(*uo  digit,  namely  x^°y^  can  be  computed 

independent:  of  all  others  (ie:  no  carry  information  requirements) .  In 
practice,  the  mapping  of  x^  and  y.  into  x^°y^  is  accomplished  using  table 
lookups  where  the  table  residue  on  randomly  accessed  read-only  memory.  Typical 
high-speed  memory  modules,  which  are  currently  available,  are: 


TABLE  2 


Device 

Type 

Technology 

I  ... . 

Configuration 

Access-Speed 

10149 

ROM 

ECL 

256x4 

20  ns 

SM54S 

ROM 

TTL 

1024x4 

35  ns 

214711-1 

RAM 

HMOS 

4096x1 

30  ns 

2125H-1 

RAM 

HMOS 

1024x1 

20  ns 

12167 

RAM 

HMOS 

16384x1 

45  ns 

IMS  1400 

RAM 

MOS 

16384x4 

30  ns 

The  product  of  two  residues  modulo  p^ ,  p_j<2n,  can  be  precomputed  and 
stored  in  a  2mxn-bit  memory  unit  where  m=2n.  Using  a  large  existing  high¬ 
speed  memory  (4Kxl  at  30  ns),  residues  having  up  to  six  bit  integer  values 
can  be  used  (ex:  P  =  {64,63,...}).  Thus,  fixed-point  multipliers  having  a 
dynamic  range  of  [-M/2 ,M/2)  can  be  architected  which  have  execution  rates 
in  the  low  nanoseconds. 

The  disadvantages  of  the  residue  number  systems  are  manifold.  Since 
the  RNS  possess  no  most  significant  digit,  decimal  to  residue  conversion, 
division,  magnitude  comparison,  and  arithmetic  shift  operations  are  cumber¬ 
some  and  should  be  avoided.  Register  overflow,  due  to  its  finite  dynamic 
range,  impose  a  severe  constraint  on  the  RNS  operations.  Unlike  weighted 
numbers  (decimal,  binary,  etc.)  where  rounding  or  truncating  least  significant 
digits  can  control  overflow,  such  is  not  the  case  in  the  RNS.  Since  there 
is  an  absence  of  least  significant  digits,  the  more  general  and  inefficient 
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operation  known  as  sca^ng  must  be  used.  Since  scaling  is  a  form  of  division, 

its  use  should  be  discouraged.  To  gain  insight  into  this  problem,  consider 

the  inner  product  of  two  31-dimensional  real  vectors  y  and  y  whose  entries 

are  encoded  as  residue  digits  with  respect  to  P  =  (32,31,29,27).  Without 

5 

scaling,  the  worst-case  value  of  x  and  y  would  be  limited  to  V-  where 

V  =  (M/2)/31  =  25056.  Therefore,  to  insure  that  no  worst  case  overflow 

5  7  3 

can  occur,  a  7.3-bit  ( ie :  V-  =158=2  *  )  dynamic  range  limitation  must 
be  imposed  on  x  and  y.  With  scaling,  larger  input  ranges  can  be  used  at 
the  expense  of  statistical  accuracy  in  the  output  space  (analogous  to 
roundoff  errors). 

Due  to  the  dynamic  range  limitation  of  RNS  systems,  one  is  generally 
forced  to  accept  one  of  the  following  two  overflow  prevention  strategies. 

1.  Increase  the  dynamic  range  to  a  sufficiently  large  value 
by  adding  more  moduli  to  P,  or 

2.  Make  scaling  a  more  efficient  operation. 

The  first  option  represents  a  brute  force  attack  to  the  problem.  Such 

an  approach  will  increase  to  cost  and  complexity  metrics  of  a  filter.  In 

addition,  the  moduli  set  P  must  be  tailored  to  unique  filter.  The  other 

approach  appears  to  be  the  most  popular  at  this  time.  Szabo  and  Tanaka, 

and  others,  have  concentrated  on  the  scaling  efficiency  through  the  choice 

of  the  three-tuple  moduli  set  P  =  (2n-l ,  2n,  2n+l}.  This  moduli  set  has 

the  ability  to  efficiently  scale  a  residue  number  by  any  one  of  the  chosen 

moduli.  However,  there  is  an  intrinsic  limitation  plaguing  this  method  and 

it  is  its  dynamic  range.  Using  a  large  high-speed  memory  unit,  say  4Kxl , 

1 2 

the  input  addressing  space  is  limited  to  2  .  This  means  that  a  moduli  p. 


n 


6  12 

is  technically  limited  to  p.<2°  (ie:  x^-y^<2  ).  Therefore,  the  dynamic 
range  of  any  modular  operation  is  given  by  M=(2n- 1 )(2n)(2n+l )^2°n=2 ^ ® . 

In  many  applications,  an  18-bit  resolution  is  insufficient  resolution, 

LARGE  MODULI  AU 

It  is  desirable  to  keep  the  previously  discussed  three  moduli  structure 
for  purposes  of  potential  scaling  needs.  However,  in  order  to  overcome  the 
existing  disadvantages  of  this  system,  that  of  dynamic  range,  a  new  archi¬ 
tecture  is  called  for.  Since  it  is  unrealistic  to  assume  substantially  larger 
density  high-speed  memories  will  continue  to  become  available,  it  is  incum¬ 
bent  that  more  memory  efficient  residue  arithmetic  unit  be  designed.  An 
efficient  algorithm,  which  is  ideally  suited  for  this  application,  is  known 
as  the  quarter-square  multiplier. 

<xy>  p  =  «P(s+)-<p(s~)>p  14. 

where  <f>( s)  =  <s  >p  with  s  =  (x+y)/2  and  s  =(x-y)/2. 

The  quarter-squared  multiplier  has  been  studied  by  J.M.  Pollard  (1976) 
in  a  Galois  field.  Questions  of  hardware  implementation  were  not  considered 
and,  due  to  the  Galois  field  limitation,  only  prime  moduli  could  be  considered. 
H.  Nussbaumer  (1976)  studied  the  quarter-square  multiplier  over  real  fields 
for  use  in  ROM  intensive  digital  filters.  Soderstrand  and  Fields  (1977)  made 
brief  reference  to  this  multiplier  for  residue  arithmetic  but  offered  no 
satisfactory  hardware  realization.  Our  research  has  produced  a  practical 
residue  arithmetic  quarter-squared  modular  multiplier  in  commercially  available 


hardware. 
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A  problem  that  would  seem  to  plague  the  quarter-square  multiplier  is 
the  need  to  realize  the  division  by  two  the  sums  and  differences.  In  general, 
the  existence  of  an  N" ^ ,  such  that  <N ~^N>  =1,  can  only  be  guaranteed  if  N 
is  relatively  prime  to  p.  Since  one  of  the  chosen  moduli  is  p=2n,  multi¬ 
plicative  inverse  of  2  cannot  be  guaranteed  to  exist.  Therefore,  the  quarter 

cannot  be  directly  interpreted  as  the  equation  «l/4>  <(x+y)  -(x-y)  >  >  . 

pi  pi  pi 

The  potential  problem  of  dividing  the  sum  of  differences,  found  in  equation  4, 
by  2,  will  be  explicitly  and  efficiently  treated  for  the  first  time  later  in 
this  paper.  For  a  2m  word  memory  unit,  the  direct  product  architecture  ( i e : 
xy)  would  limit  the  maximal  moduli  to  be  bounded  by  2n,  n=m/2.  In  fact,  this 
claim  can  be  extended  to  the  case  where  p  =  2n+l  through  use  of  the  following 
modification.  Observe  that  if  x^  =  0,  then  it  automatically  follows  that 
<xiyi>p.  =  therefore,  if  x^=0->0A0. . .0  (which  is  detectable  condition  in 
that  the  (n+l)st  bit  and  remaining  n-bit  block  is  zero  (0+0^00. . .0))  the  out¬ 
put  register  would  be  automatically  cleared.  Therefore,  the  lookup  table  need 
not  be  accessed  for  this  case.  Instead,  the  all  zero  n-bit  portion  of  the 
table  address,  allocated  to  x^,  can  be  used  to  represent  x.j=2n  where  x^  = 
2n->1^00. .  .0.  Here,  the  table  would  be  programmed  to  map  y^  into  <2ny,.>p 
using  only  a  2m  word  memory. 

The  memory  requirements  associated  with  the  quarter-square  multiplier 
are  substantially  less  than  those  of  direct  mechanizations.  First,  it  should 

“I*  — 

be  apparent  that  the  integers  s  and  s“,  found  in  equation  14,  are  bounded 


yj  .i  *1 

from  above  by  2  .  Therefore,  only  a  (n+l)-bit  table  addressing  space  is 

+ 

required  to  realize  (s~)  versus  the  2n-bit  space  needed  for  direct  architectures. 


It  would  appear  however,  that  there  is  an  exception  to  this  rule.  Since  one 
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of  the  moduli  chosen  is  p  =  2n+l .  Here  the  maximal  value  of  s+(or  s’)  is  2n+1 
which  would  technically  require  a  (n+2)-bit  address.  However,  by  using  the 
protocol  found  in  Figure  2,  which  is  an  adaptation  of  the  network  found  in 

n  .  T 

Figure  1,  the  table  size  can  be  reduced  to  2  words  for  all  moduli.  Here, 

the  overflow  bit  serves  to  differentiate  s--0  from  2n+^ . 

The  quarter-squared  architecture  is  abstracted  in  Figure  3.  It  uses  a 

2n  word  high-speed  memory  for  modular  arithmetic  lookup  operations.  Using, 

for  example,  the  previously  referenced  4K-30  ns  device,  moduli  having  an  11-bit 

dynamic  range  (vs.  6-bit  in  the  direct  form)  can  be  mechanized.  This  would 

yield  a  three-moduli  dynamic  range  on  the  order  of  2  v  ' -8*6-10  .  That  is, 

without  an  increase  in  memory  size  (and  therefore  access  time),  the  dynamic 

3^  1  ft  1  *5 

range  of  the  quarter-squared  is  2  /2  =2  times  larger  than  that  obtainable 
through  direct  means!  This  large  increase  in  dynamic  range  makes  the  RNS  a 
viable  alternative  to  traditional  filter  design  methods.  Both  improved  pre¬ 
cision  and  throughput  (through  the  reduction  or  absence  of  traditional  scaling 
operations)  can  be  achieved. 

Several  versions  of  the  multiplier  algorithm  can  be  considered.  They  are 
summarized  in  Figure  4.  The  first,  called  the  sequential  form,  would  have  an 
estimated  throughput  rate  of  240  ns  based  on  a  60  ns  lookahead  adder  and  memory 
having  an  access  time  of  30  ns  with  a  cycle  time  of  60  ns.  The  second  archi¬ 
tecture,  called  the  parallel  form,  would  run  at  a  180  ns  rate.  The  parallel 
architecture  is  preferred  because  its  higher  speed,  simpler  control.  A  60  ns 
pipelined  execution  rate  can  be  purchased  at  a  small  hardware  cost. 

Upon  closer  investigation  of  the  table  lookup  data  base,  a  potential 
nuisance  can  be  found.  It  can  be  examplified  by  observing  that  if  s-  is. 


IX  IX  IX  IX 


RAM  or  ROM 

Z=XY  MOD  P 
P  =  2n+ 1 


=  x  IF  x,*  o 
"-2nIF  XL=  0 

5  Y  JF  V  0 
=  2nlFYL=0 


Fi Kuro  1.  Modulo  2n+ 1 


Figure  2.  Memory  Compression  For  s- 


240  ns 


f3-  60  -63-  60  “cAl-60  ~d^~60 


Figure  4 .  Architectures 


odd  and  p=2n,  then  <4>(s-)=<s2/4>22=x.25.  Therefore,  it  may  be  required  that 
two  additional  fractional  bits  may  need  to  be  added  to  the  table's  output 
wordlength.  However,  this  is  not  the  case  as  suggested  by  the  following 
theorem: 

Theorem:  Let  l|  v[j  denote  the  integer  value  of  v.  Then  z=<||  d>(s+)||-  j!$(s~)||  >  . 
That  is,  only  the  integer  value  of  <j>  need  be  used  and  the  fractional  bits  of 
<f{ s— )  can  be  ignored. 

Proof:  For  x,  y  and  k  integers,  one  may  define  two  rational  numbers,  namely 
(x+y)/2-v+k/2;  (x-y)/2=q+b/2  where  k=0  or  1.  Then  z=<<(x+y)2/4>p-<(x-y)2/4>p>p= 
<<v+kv+b2/4>p-<q+dv+k2/4>p>p=<<v+kv>p+k2/4-q+k>p-b2/4>p=<4i(  s+) -4>( s-  )>p . 

As  a  result,  the  parallel  architecture  is  equivalent  to  that  shown  in 
Figure  5.  Furthermore,  by  deriving  the  above  theorem  over  a  rational  field, 
and  showing  that  the  results  pertain  to  the  integers,  several  classical  pro¬ 
blems  are  overcome: 

1.  The  quarter-squared  multiplier  is  not  restricted  to  the 
Galois  fields  suggested  by  Pollard. 

2.  The  question  of  the  existence  of  the  multiplicative 
inverse  of  4  is  now  moot. 

MODULO  p  ADDER 

The  quarter-square  multiplier  requires  a  modulo  p  adder  be  used  to  com¬ 
bine  the  two  component  parts  of  the  solution  (namely  4>(s+)  and  <£{s~)).  Modulo 
p  adders  pose  an  interesting  design  problem.  Unless  a  fast  modulo  p  adder 
can  be  fabricated,  the  overhead  associated  with  addition  will  offset  any  gain 
in  throughput  achieved  through  table  lookups.  For  the  moduli  chosen,  2n-l,  2n, 
and  2n+l,  only  the  modulo  2n  adder  can  be  realized  directly  (n-bit  adder  with 
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ignored  overflow).  It  would  however,  be  desirable  to  use  a  n-bit  adder  to 
realize  the  modulo  2n-l  and  2n+l  adder  as  well.  For  the  purpose  of  clarity, 
let  s  be  defined  to  be  the  sum  of  4>(s+)  and  4>(s").  The  following  observation 
then  follows: 


TABLE  3 


Case 

Dynamic  Integer 
Range  of  S 

Hodu 1 o 
<s>2ll 

2^  Adder 
OVF-BIT 

Modulo 

Pf 

p.  Adder 

h>, 

«'i 

Example :N=3 
s  ; 

1 

s=0 

0 

0 

2N-1 

0 

2 

1<s<2N-2 

s 

0 

2fi-l 

s 

4 

4 

3 

s=2N-’ 

s 

0 

2f,-l 

0 

7 

0 

-1 

s=2N 

0 

I 

2N-I 

s-2N+1 

0 

1 

5 

2fW-< 

s-2N 

1 

2N-1 

s-2M+1 

10 

3 

6 

s=0 

0 

0 

2H 

0 

0 

0 

7 

1<s<2N-1 

s 

0 

2" 

s 

4 

4 

8 

s=2f} 

0 

1 

N 

2 

0 

8 

0 

9 

2N+1<s<2N-2 

s-2N 

1 

2N 

s-2N 

10 

C 

10 

s=0 

0 

0 

2M+1 

0 

0 

0 

11 

l<s<2f,-1 

s 

0 

2N+1 

s 

4 

4 

12 

s=2N 

0 

1 

2NH 

s 

8 

8 

13 

2N+1<s<2NM-1 

s-2N 

1 

2HH 

s-2N-1 

10 

1 

1/1 

1  (special  case) 

0 

0 

2HH 

s-2M-1 

16 

7 

Using  n-bit  AflD  gates  to  sense  the  zero  condition  of  <s>2N,  the  overflow 
bit  OVF  the  sign  bits  of  4>(s+)  and  <Ks~),  combinational  logic  can  be  defined 


which  will  <s>«N  into  <s>  .  It  can  be  noted  from  the  data  found  in  Table  1 
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that  the  mapping  requirements  are: 

1.  for  p  =  2  -1,  map  s  to  s  or  s-2‘  +1=«s>  ,,+!> 

2.  for  p  =  2^,  map  s  to  s-2N=<s>  ^ 

2n 


3.  for  p  =  2"-*-] ,  map  s  to  s  or  s-2ij-l=«s> 


N 


2" 


Mapping  two  is  trivially  satisfied  with  an  n-bit  adder.  The  other  two 
mappings  require  that  s  remains  unchanged  or  it  is  decremented  or  incremented 
by  unity.  There  are  several  ways  to  approach  this  problem.  Bioul,  Davis, 
and  Quisauater  have  presented  an  unorthodox  architecture  for  a  modulo  (2n-l) 
adder  using  two-input  gates. Modulo  (2n+l)  adders  can  also  be  realized 
through  the  use  of  end-around-carries.  However,  compared  to  modulo  2n  addi¬ 
tion,  this  approach  would  almost  double  the  addition  delay.  This  extended 
delay  problem  can  be  overcome  through  added  complexity  (ie:  time  multiplexing 
two  end-around-carry  adders).  Mapping  one  and  three  can  be  efficiently 
realized  in  the  manner  suggested  by  the  example  found  in  Appendix  A.  The 
functional  operation  of  adding  one  (mapping  1)  or  subtracting  one  (mapping  3) 
from  the  output  of  an  n-bit  adder  is  performed  by  a  PLA.  The  PLA  will  provide 
an  overlay  mask  which  accomplishes  the  required  task.  Ihe  derivation  and 
utility  of  the  mask  can  be  understood  in  the  context  of  the  following  example. 
Example:  Suppose  s  is  an  11  -bit  word  having  a  decimal  value  of  s-j q  =  92  or 
$2=00001011 100.  If  s-jq-1  =  91  or  (s-iq-I^  00001011  ;011;  is  desired,  one  notes 
that  only  the  3-LSB’s  of  S2  need  be  altered.  In  general,  for  n=12,  only  the 
following  13  distinct  birary  masks  are  required  to  form  (s-jo-^2* 
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MSB  Pattern  LSD 

XXXXXXXXXXXX 
XXXXXXXXXXX  0 
XXXXXXXXXX01 

X  0  1  1  1  1  1  1  M  1  i 

0  II  11  1  1  1  I  1  |  1 


Nota  tion 

X  =  leave  corresponding  bit 
location  of  S2  unchanged  l(or  0) 
=  change  corresponding  bit 
location  of  to  1  (or  0) 

Table  II.  MASK 


Suppose  the  moduli  p  =  2n+l ,  n  =  12,  is  to  be  implemented.  By  using  two 

corrmercially  available  16x9  PLA's  in  parallel,  the  12-bit  output  of  an  n-bit 

adder  (shown  as  <s>  ,,  in  Table  I)  and  the  four  previously  specified  control 

c 

bits,  can  be  converted  to  13-bit  mask.  The  mask  would  transform  the  output 
of  a  high-speed  n-bit  adder  to  s  or  s-2n-l ,  depending  on  the  state  of  the  4 
control  bits.  Based  on  a  25-ns  12-bit  Schottky  lookahead  adder,  a  20-ns 
16x9  PLA,  and  10-ns  FET  mask  switches  (in  notation  corranents  of  Table  II)  a 

65-ns  modulo  p  adder,  for  p=2n-l ,  2n,  and  2n+l  can  be  realized.  The  presence 

of  a  65-ns  modulo  p  adder  will  now  allow  a  140-ns  large  moduli  residue  multi¬ 
plier  based  on  35-ns  4Kxl  HMOS  memory  units.  (See  Figure  5)  For  a  moduli  set 
ip  12  12 

{2-1,2  ,2  +l},  a  fixed  point  multiplier,  having  an  output  dynamic  range 

or  I  o 

of  2  -2  ,  can  thus  be  fabricated  having  a  word  rate  of  7.143  M  multiplications 

per  second.  This  compares  favorably  with  new  16x16  VLSI  multipliers.  Using  a 
pipelined  architecture,  which  requires  the  insertion  of  the  storage  registers 
found  in  Figure  5,  a  ve.y  impressive  throughput  figure  of  28.5  M  multiplica¬ 
tions  per  second.  It  is  important,  and  fortunate  to  realize  that  the  Intel 
HMOS  memory  unit,  used  in  this  analysis,  has  a  cycle  time  equal  to  the  access 
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time.  If,  as  is  often  found  in  practice,  a  memory  unit  has  a  cycle  time 
approximately  twice  the  access  time,  then  pipeline  delay  would  increase 
from  35-ns  to  70-ns. 

VLSI  RMS  MULTIPLIERS 

As  previously  noted,  16-bit  three-moduli  35  ns  pipelined  multiplier  is 
more  than  three  times  as  fast  as  the  VLSI  unit  but  consumes  more  than  four 
times  the  power  and  is  significantly  more  complex.  However,  the  above  VLSI 
multiplier  units  are  designed  to  work  in  2's  complement  are  therefore  do 
not  support  residue  arithmetic  directly.  In  this  paper,  the  algebraic 
elegance  and  speed  of  the  RNS  is  combined  with  the  technological  advantages 
of  VLSI  to  achieve  high-performance  modular  multiplier. 

Since  the  RNS  is  an  exact  numbering  system,  the  nesting  of  modular 
arithmetic  operations  can  result  in  register  overflow.  Register  overflow 
occurs  when  the  result  of  an  arithmetic  operation  exceeds  the  admissible 
dynamic  range  M.  For  a  set  of  relatively  prime  moduli  set  P={p1 ,. . . ,p, } , 

M=IIp - , i=l ,2,. . . ,L.  Overflow  prevention  in  the  RNS  is  accomplished  through 
the  use  of  a  relatively  inefficient  operation  referred  to  as  scaling.  This 
can  be  mechanized  using  the  mixed-radix  conversion  algorithm  or  the  Chinese 
Remainder  Theorem. ^  To  insure  that  there  will  be  some  degree  of  efficiency 
in  the  scaling  operation,  the  moduli  set  must  be  carefully  chosen. ^  A 
particularly  useful  moduli  set  is  P={2n-1 ,  2n,  2n+l).  Based  on  this  choice 
of  moduli,  a  VLSI  based  residue  multiplier  can  be  realized  in  commercially 
available  hardware. 

VLSI -RNS  MULTIPLIER  STRUCTURE 

This  structure  will  be  presented  as  three  special  cases. 


/ 
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Modul i  p=2n 


For  the  purpose  of  discussion,  consider  p=2n  to  be  a  moduli  and  x, 
xeZp,  to  be  the  composite  number 

<x>p=x=2mxHIML0i  VXHI  or  xL0,  Ofx^-l  16. 

where  m=n/2.  Here  x^j  and  x^q  are  m-bit  positive  integers  and  Zp  is  the 
residue  class  of  integers  modulo  p.  For  a  y  having  the  same  format,  it 
follows  that  z=<xy>p  is  given  by 


where: 


z=<xy>p=<2na+b+2m{ c+d) >p 

a=xHIyHI;  0<a<(2--l)2<2n-l 
b=xL0yL0’  2m-1 )  2<2n-l 
C=XHIyL0;  0£C<(2m-l)^<2n-l 
d=xLOyHi;  0<d<(2m-l)2<2n-l 
v=c+d;  0<V<2(2m-l)2 


17. 

18. 


Under  the  hypothesis  that  p=2n,  and  noting  2m=2n/2m,  z  computes  to  be 


z=<<2!Vy/^>2n+<b>2n+2n(c+d)/2m>2n>2n 


19. 


The  last  term  in  equation  19  may  seem  to  pose  a  potential  hardware  realiza¬ 
tion  difficulty.  However,  this  need  not  be  the  case  in  light  of  the  follow¬ 
ing  interpretation.  Suppose  that  the  (n+l)-bit  binary  representation  of  the 


/ 


positive  integer  v=(c+d)  has  the  form  xx...x  (x=0  or  1).  Then  V/2m  can  he 
formed  fay  simply  defining  the  binary  point  to  precede  the  mth  LSB.  That  is, 

V/2m  =  IV+.XV  where  IV  is  the  integer  part  of  V/2m  and  XV  the  fractional 

part.  Thus  <2 nV/2m>2n  =  <2n ( I V+ . XV) >2n  =  <2n-^1v>2n+<2n(XV)>2n.  Computing 
<2n(XV)>2n  could  promise  to  be  an  inefficient  operation  if  conventional 
digital  methods  are  used.  However,  this  need  not  be  the  case  since  XV  is 

known  to  be  a  m  bit  word  where  m  is  n/2.  For  example,  if  a  24-bit  moduli  is 

desired  (which  represents  a  substantial  improvement  over  the  5-bit  moduli 
typically  found  in  the  literature),  then  m=12,  and  a  4K  x  1  high  speed  (35ns) 
memory  can  be  used  to  implement  the  mapping  <2n(XV)>2n  as  a  table  lookup 
operation.  The  partial  product  terms  could  then  be  combined  by  a  moduli  2n 


adder  to  form  z=<xy>p. 


Moduli  p=2  -1 


Equation  16  can  be  rewritten  in  terms  of  the  following  set  of  relation¬ 


ships 


i:  2n=(2n-l )+l 
ii.  2m=2n/2m=(2n-1)/2m+l/2m 


z=<((2n-l )a+a)+b+( (2n-l ) (V/2m)+V/2m)> 


From  the  previous  analysis,  one  notes  that  <a>  =a,  <b>  =b  and  V/ 2  -IV+.XV, 

r  r 


<(2n-l ) V/2m>  =<(2n-l ) (IV+.XV) >=<(2n-l ) .XV>  . 

r  r  r 


Using  lookup  operations  and  a  2m  word  memory  unit  the  modular  mapping  can 
again  be  implemented  directly.  The  term  V/2m  is,  as  previously  stated,  is 
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simply  reassignment  of  the  binary  point  of  V.  Again  the  partial  product 
terms  would  be  recombined  using  a  moduli  p  adder. 

Moduli  p=2n+l 

This  case  requires  special  attention  since  it  is  not  completely 
analogous  to  the  previous  case  considered.  In  particular,  not  all  the 
residues  in  the  residue  class  can  be  encoded  into  an  n-bit  word  and 
represented  as  x=2mxHj+x^Q.  In  other  words,  the  admissible  residue  x=2n 
does  not  conform  to  the  accepted  data  format.  However,  x=2n  is  an  easily 
detected  case  since  it  is  represented  by  X+1000...0  (ie:  test  MSB  for  i 
and  AND  with  n-LSB's  of  0’s).  If  x  is  detected  to  have  a  value  of  2n, 
then  only  the  following  events  are  admissible 

TABLE  4 


X 

y 

z=<xy>p;  p=2n+l 

example,  n=6 

c 

<2n 

<(2n+l)y-y  2nH  =  '-y>2nH 

<64(y=5)>65 

=65-6=60 

2n 

2n 

02n 

<2  YM 

=  <(2n+l)2-2(2n+l)H>  =1 

2n+l 

<4096>65=1 

These  two  possible  events  can  be  separately  programmed  without  reducinn 
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throughput.  That  is,  upon  receipt  of  x  (or  y)  =  2n,  the  output  will  be 
immediately  set  to  <-y>p  (or  1). 

An  architecture  capable  of  realizing  the  proposed  large  moduli  multi¬ 
plier  in  VLSI  is  suggested  in  Figure  5.  This  system  is  composed  of  four 
commercially  available  VLSI  multipliers,  one  custom  VLSI  Quad  moduli  p  adder, 
and  memory  units  for  table  lookup  use.  More  will  be  said  on  the  structure 
of  the  modulo  p  adder  in  the  next  section.  For  values  of  n=24  or  16-bits, 
and  based  on  commercial  multiplier  specifications,  a  three  moduli  multiplier 
system  can  be  built  having  a  dynamic  range  on  the  order  of  72  to  48-bits. 
Furthermore,  based  on  these  parameters,  a  multiplier  can  be  partitioned  into 
four  100  ns  operations.  This  translates  into  a  real-time  throughput  of  2.5  M 
multipliers  per  second  for  a  serial  realization  or  10  M  multiplier  per  second 
when  pipelined  (a  most  impressive  72  to  48-bit  multiplication  rate).  The  multi¬ 
plier,  suggested  in  Figure  5  performs  the  following  ->erations. 
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TABLE  5 


Operation 

p=2 

p=2  -1 

p=2  vl 

Level 

Remarks 

S1:<Amymy 

0 

a 

-a 

1 

'/LSI  multiplier 

S2:<XLOyLoV 

b 

b 

b 

1 

VLSI  multiplier 

T1  :xIIIyLO'" 

c 

c 

c 

1 

VLSI  multiplier 

T2:xLOyHT> 

d 

d 

d 

1 

VLSI  multiplier 

/ 

U :  <a+b>  ■> 

p 

—  —  —  — 

-  “  —  —  ' 

~  —  —  — 

1) 

b-a 

'i 

C 

mod  p  adder 

V:c+d>- 

IV 

IV 

IV 

2 

adder-shift  register 

V : 

0 

+  IV/2m 

-  IV/2m 

2 

adder-shift  register 

XV: 

\ 

.XV 

.XV 

.XV 

2 

adder-shift  register 

f 

“ 

"  “*  " 

—  — 

W1 :  <I+V> 

P 

W1 

W1 

W1 

3 

i 

mod  p  adder 

W2:<p.XV> 

i 

w2 

w2 

w2 

3 

table  lookup 

|zr«w1+w2>|> 

2 

2 

2 

4 

mod  p  adder  ' 

Example:  n=6,  m=n/2=3,  p=2^=64 

<xy>64=<558>64;=46 

Let  x  =  18  =  2(8)  +2  2:2  (l!I:LO) 

<xy>63=<558>63=54 

y  =  31  a  3(8)'7  3:7  (111:1.0) 

’ xy  66  r'  '558"6b  “  38 

0=6.  b=14 ,  c=14  ,  d-f) 
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TABLE  6 


Operation 

p=61 

p=63 

p*65 

SI  :a 

'M(6)y0 

<03(0)  f0;,p*0 

.  •G0(0)-0s(ia-6 

$2:1) 

<14>  =  14 

P 

-14  V1 4 

•:14>  =14 

P 

T1  :c 

14 

14 

14 

T2:d 

6 

0 

6 

U:u 

14 

20 

8 

V :  v 

20 >001 01 00 

20 >001 01 00 

20-0010100 

V' : 

SF.T=0 

20/8  >+001 0.1 00 

-20/8--0010. 100 

.XV 

.100 

.100 

.100 

VJ1  :w1 

14^0001110 

22.5->0010110.100 

5 . 5->00001 1.100 

VJ2:Wp(  lookup) 

32 >0100000 

31 . 5->001 1111.1 00 

32.5-0100000.100 

rvj 

14+32=40 

22.0+31 . 5=04 

5.5+32.5=38 

RNS  TO  DECIMAL  CONVERSION 

One  of  the  major  disadvantages  of  the  residue  numbering  system  (RNS) 

Is  Its  inability  to  efficiently  perform  magnitude  comparison.  Magnitude 
comparisons  are  critical  to  general  purpose  RNS  operations  since  they  are 
generic  to  the  management  of  dynamic  range  (register)  overflow  and  conditional 
branching.  Unlike  weighted  numbering  systems,  where  overflow  can  be  efficiently 
handled  by  comparing  data  fields  starting  at  the  most  significant  digit,  RNS 


VLSI/RNS  .'IULTIPLIER 


/ 


procedures  are  complex  and  time  consuming. M  Various  versions  of  these 
RNS-to-decimal  routines  have  been  published  which  make  use  of  modular  table 
'ook-up  operations  and  distributed  (bit-slice)  arithmetic. f’3,14]  However 
the  methods  reported  in  the  literature  require  a  significant  hardware  invest¬ 
ment  and  consume  a  disaproportionate  amount  of  run-time  compared  to  other 
«  computational  operations  (viz:  addition,  subtraction,  and  multiplication) 
m  the  msn  program,  a  three  new  RNS-to-decimal  has  been  developed  which  is 
significantly  more  efficient  than  existing  techniques. 

«Uh  respect  to  the  moduli  set  P  .  there  exists,  for  0<x<« 

three  unique  mixed  radix  conversion  (MC)  digits  x  «  such  that 

X  =  x,  +  x2p2  +  XjP2P3  23 

where  x  RNS  <x, ,x2,x3)  with 


x,  =  x2 


k2  ~  <P 9  Cpo3*<x^-X'j>  > 

^  3  3  !  p3  p3 


*3  -  *p3  Cpl ’ tPj 1*[<X, -x2>p^ -rp.-' [P3l*<X3-X2.'p^>p^o2]>p^ 

More  specifically,  for  the  choice  of  moduli  P=(2r*-1 ,2n,2"+i; ,  „  fo7)ows  that 

?2''[P,>1;  P2',[P3>2n;  Pg'^tp, 3=2"-’  25. 

Upon  substituting  these  multiplicative  inverses  into  equation  2,  one  obtains 
X1  =  x2 


x2  =  <2  *<x?-x?>  > 

3  2  2n+l  2n+l 


n  ,  ~  <~(^2)> 

+1  c  2n+] 


26. 
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=  <2n”^*[<x 


-x~>  „ 

c. 


-<xrx2>  ]> 

1  J  c  2n+l  2-1 


Functionally,  it  can  be  seer,  that 

X]  =  f1(x2);  X2=f2(x1,x3);  x^x^x^x-}  27. 

The  MRC  algorithm  can  of  course  be  realized  oy  using  sequential  methods.  Here, 

nexted  modulo  adders,  and  p^[j]  multipliers  would  be  used  to  compute 

x-j,x2,7,.  The  three-tuple  of  mixed  radix  digits  would  be  used  to  compute 

x  (equation  23)  using  these  multiplications  and  additions.  The  disadvantage 

of  the  direct,  apptoach  is  execution  speed  due  to  the  sequential  complexity 

of  the  algorithm.  Throughput  improvements  and  a  reduction  in  complexity  can 

be  achieved  by  using  memory  based  table  lookup  operations  to  replace  some 

arithmetic.  If  high  speed  is  to  be  achieved,  high-speed  memory  units  must  be 

employed.  Such  memories  have  a  fairly  restrictive  input  addressing  space  (5 

to  12-bits).  If  mapping  f^  is  to  be  realized,  by  presenting  all  three 

12 

residues  to  a  4K-35ns  RAM  or  ROM,  then  n=4,  and  M<2  . 

n  n  p 

Consider  again  the  three  moduli  case  where  P={2n+1 ,2  ,s  -1}  which 
specifies  an  RNS  dynamic  range  M=p-jp2p3.  Based  on  a  4K-word  high-speed 
memory  model,  the  previous  medium  moduli  RNS-to-decimal  converter  was 
practically  limited  to  a  size  of  six-bits  per  moduli  (ie:  M^2  ).  The 
method  presented  in  this  research  targets  a  12-bit  moduli  for  practical  use 

or 

(ie:  M^2  ).  The  developed  large  moduli  scheme  can  be  easily  motivated  by 
the  data  found  in  Figures  6a  and  6b  plus  Table  7.  The  data  found  in  these 
figures  and  tables  are  based  on  the  moduli  set  P={5,4,3}  and  M=60.  The  first 
three  entries  found  in  Table  7,  namely  x3,x2  and  X7 ,  are  the  residues  digits 
of  x  for  x  monotonically  increasing  over  [0,59].  The  fourth  and  fifth  entries 
namely  J1  and  J3,  are  hybrid  parameters.  Since  J p2_Pi 1 = 1 P3~P2 ! va^ues 
of  J1  and  J3  will  increase  by  unity  (in  a  moduli  p^  sense)  for  monotonically 
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increasing  values  of  x.  The  important  observation  is  that  J1  and  J2  naturally 

2  2 

decompose  into  a  system  of  cyclic  patterns  which  shall  be  denoted  and  S3 

over  a  subcover  of  M,  say  $.>.  More  specifically, 

2  1  2 
=  three  sets  of  five  subsets  of  four  elements  each  and  0(S  ^  )=60 

2  2 
S3  =  five  sets  of  three  subsets  of  four  elements  each  and  0( S ^  )=60 

$2  =  (kp2| |0<k<PiP3) 

In  general,  for  P={2n+1 ,2n,2n-l }={p^ ,p2 ,P3) : 

2  2 

S-j  =  p^  sets  of  p-|  subsets  of  P2  elements  each  and  0(S-|  )=M=Trp^ 

2  ? 

$3  =  p-j  sets  of  p3  subsets  of  P2  elements  each  and  0(83  )=M=irp^ 

2  2 

Using  more  traditional  algebraic  terminology  S-j  ,  S3  and  S2  are  ideals 
in  the  ring  of  integers  modulo  M  (ie:  Z^).  It  is  well  known  that  in  general 
the  mapping 

x  ^  (x-j  ,X2»..  •  »XjJ ,  x-j=<x>i_  28. 

is  an  onto  homomorphism  with  kernel  .-I..  For  I *={kp - }  (as  is  this  case 

sj  *  * 

here),  . I  -=0  and  Maher  has  shown  the  mapping  to  be  isomorphic. It  should 

J  J 

also  be  apparent  that  due  to  the  cyclic  nature  of  J1  and  J2,  that  any  7 
belonging  to  the  subcover  S2  has  the  block  representation 

x  =  {(pjll+jl)*p2;  0<ll<p3;  0<Jl<Pj)eS^  29. 

or 

7  =  {(P3l3+J3)*p2;  0<l3<p-|:  CkJS^JeSg2 

/  ) 


30., 
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where  I1,I3,J1,  and  J3  are  integers.  Equating  equations  12  and  13,  one 
obtains 

P31 3-p-j  1 1  =  (01-03)  31. 

which  is  of  the  form  ax+by=c.  Equation  14  possesses  a  very  important  pro¬ 
perty  which  will  now  be  derived. 

r !  71 

Lemma  1.  J  If  a,  b,  c  are  integers  and  at  least  one  a,  b  is  nonzero, 
set  d=gcd(a,b),  then  a  solution  to  the  Diophuntine  equation 

ax  +  by  =  c  32. 

exists  for  integer  values  of  x  and  y  if  and  only  if  d/c. 

Lemma  2:  If  b  is  relatively  prime  to  a,  then  the  congruence  by  -  c 
mod  a  has  an  integral  solution  x.  Any  solutions  x-|  and  x ^  are  congruent 
modulo  a. 

From  these  two  lemmas,  the  following  theorem  can  be  stated: 

Theorem:  Given  the  Diophantine  equation  14 

p-j II -P3I3  =  (J1-J3)  =  c  (see  Figure  3b)  33. 

the  solution  two-tuple  (11,13)  is  unique. 

The  proof  is  straightforwarded  and  is  based  on  the  fact  that  p^  and  p^ 
are  relatively  prime,  netO.p^-l],  and  I3e[0,p,-1].  Therefore,  by  specifying 
c,  the  block  indices  (11,13)  can  be  uniquely  determined.  Observe  that 
can  be  derived  from  knowledge  of  the  two-tuple  (11,13).  However,  (11,13)  is 
uniquely  determined  by  c=Jl-J3.  Therefore,  upon  presenting  a  ( n+1 ) -bi t  word 
c  to  a  (n+1) -bit  high-speed  RAM  or  ROM,  the  precomputed  value  of  SI =p2Pl I ^ 
or  S3=P2P-3l3  can  be  outputted.  The  corresponding  value  of  xe$2  can  be 
realized  by  adding  to  s,  the  integer  p2Jl  to  SI  or  p2J3  to  S3.  Lastly,  if 


xe[0,M),  one  only  needs  to  add  x 2  to  x.  The  decimal  value  x  can  therefore 
be  computed  in  the  composite  form 


x  =  x2+p2Jl+Il p^  P2  34. 

where,  due  to  uniqueness,  the  mixed  radix  digits  are  (x2,Jl,Il). 

In  general,  for  P={2n+1 ,2n,2n-1 } ,  the  routine  would  proceed  as  follows: 

1.  Accept  x  ^  ^  (x-j^jXj) 


Form  Jl=<x0-x-,>  and  J2=<x^-x0> 

2  1  p1  3  2  p2 


Form  J1-J3  =  c 


4.  Map  (j)(c)=p1p2n=Sl  or  p3p2I3=S3 

5.  FORM  )T=p2Jl+Sl  or  7=p2J3+$3;  xeS2 

6.  FORM  x=7+x2; 

These  steps  are  numerically  examplified  in  Table  7  and  diagrammed  in  Figure  7 
for  the  {5,4,3}  system. 

Compared  to  conventional  RNS-to-decimal  conversion  algorithms,  the 
derived  algorithm  possesses  the  following  attributes: 

1.  no  modulo  M  addition  required  as  in  the  case  of  CRT  or  MRC  methods 

2.  practical  realization  of  very  large  moduli  RNS  systems 

3.  simple  architecture  and  reduced  complexity. 

Additional  refinements  in  the  proposed  method  can  also  be  obtained.  First, 
observe  that  c,  the  hybrid  parameter  which  defines  the  argument  of  the  mapping 
<j>  (item  4)  is  a  signed  integer  such  that  ce[-(2n-2)  ,2n].  Technically,  to  do 
the  mapping  <j>(c),  a  ( n+1 ) -bit  high-speed  RAM  or  ROM  would  be  needed.  This 
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suggests  that  the  largest  admissible  moduli  is  11-bits  (using  a  4K-memory 
model).  Furthermore,  since  c^ax=2n5  it  would  appear  as  though  the  output 
register  for  the  signed-adder  found  at  steo  3  would  have  ( n+2) -b i ts  if  a 
standard  binary  weighted  code  is  to  be  used  (eg:  2's  complement).  This  pro¬ 
blem  can  be  overcome  through  the  following  modifications. 

1.  Using  an  (n+l)~bit  (at  least)  sign-magnitude  adder,  the  sum  c=Jl-J3 

can  be  represented  as  a  (n+l)-bit  word  having  the  format  MSBrxx- • -x.LSB. 
The  sum  c  can  be  partitioned  into  two  sets  V  and  Z  given  by: 
xcV  if  MSB  of  x  =  “0" 
xeZ  if  MSB  of  x  =  "1" 

More  specifically,  V  is  a  set  of  2n-l  elements  determined  by 
V  =  {y  |  y  =  x  for  xe[0,2n-l]}.  Also,  z  is  a  set  of  2n-l  elements 
determined  by  Z={z|z=x  if  xe[-(2n-2) ,-l ],  z=0  if  x=2n}.  It  can  be 
seen  that  the  sets  Y  and  Z  are  defined  by  the  magnitude  digits  of 
the  signed  magnitude  value  of  c  with  the  membership  to  Y  and  Z 
determined  by  the  MSB  (sign-bit  location)  of  c.  The  importance  of 
this  partition  is  that  two  2n  word  tables  can  be  used  to  map  c  into 
4>(c).  The  device  select  line  would  be  tied  to  the  MSB  of  c  as 
suggested  in  Figure  8. 

2.  Another  efficiency  can  be  realized  by  using  data  packing.  More 
specifically,  the  term  p2Jl+x2,  for  the  considered  choice  of  moduli, 
can  be  rewritten  to  read  2nJl+x2.  Since  0<x2<2n,  and  0<Jl<2n,  the 
the  term  2nJl+x2  can  be  directly,  and  uniquely  encoded  into  a  (2n+l)- 
bit  register.  This  is  suggested  in  Figure  8. 

The  proposed  architecture,  as  in  the  direct  realization  of  the  mixed 


3. 


41 


mixed  radix  conversion,  requires  moduli  p,.  for  p.=2n-l  or  2n+1 . 

Several  such  modulo  adders  have  been  reported  in  the  open  literature. 

A  very  efficient  40nsec  modulo  2n+l  and  2n-l  adder,  for  n£l2,  has 
been  reported  in  reference  18. 

OVERFLOW  TOLERANT  RNS  MULTIPLIERS 

In  order  to  extend  the  dynamic  range  of  the  autoscale  multiplier  to  a 
more  useful  size  (say  12  to  16  bits),  based  on  a  4K  memory  model,  data  com¬ 
pression  will  be  required.  A  suitable  compression  algorithm,  based  on  the 
quarter-square  algorithm  has  been  reported  in  an  earlier  section.  Further¬ 
more,  the  theoretical  foundation  of  a  compression  scheme  has  been  motivated 
in  the  previous  section.  Here,  data  compression  will  be  studied  in  the  con¬ 
text  of  the  popular  three-moduli  system  P={2n+1 ,2n,2n-l }  such  that  M=p1p2P3=2~n-2n. 
Any  integer  over  [0,M)  has  the  unique  RNS  representation  x  (x-j^jX-j). 

Consider  now  a  subcover  of  the  range  [0,M)  generated  by  all  numbers  7  having 
an  RNS  representation  7  ^  (7j,0,73).  Obviously  7  is  defined  over  a 

subcover  of  [G,M),  say  S2  where  S2  =  {kp^ j Oj<k<p^ -  The  utility  of  this 
operation  is  that  of  data  compression.  More  specifically,  only  2n-bits  of 
data  are  needed  to  uniquely  quantify  7v(ie:  7;v(7j,73))  versus  3n-bits  for 
x  (ie:  x^(x-j  ,X2,X2))  •  The  digits  of  7  can  conveniently  be  defined  to  be: 

*1  =  x2-x2 

x2  =  x2-X2~0  35. 

x3  =  x2-x3 

As  a  point  of  interest,  this  is  also  an  operation  found  in  the  residue  to 
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mixed  radix  conversion  algorithm  used  to  determine  the  mixed  radix  coeffi¬ 
cients  of  the  weighted  representation 


x  =  Vx2p2+V.:*3 


36. 


ERROR  ANALYSIS 

The  mean  and  error  variance  for  the  extended  range  autoscale  multi¬ 
plier  is  a  function  of  the  chosen  moduli  set.  Even  the  simplest  analysis 
becomes  burdened  with  nested  sums  and  binomial  coefficients.  Instead,  the 
error  statistics  of  the  multiplier  was  studied  using  numerical  simulation. 

A  general  purpose  FORTRAN  program,  written  on  a  POP  11/60  under  RSX-11,  is 
reported  in  Figure  9.  In  Figure  10,  the  product  of  x=16  and  ye[0,29],  for 
P={3,4,5)  is  reported.  The  parameter  Z-j  is  the  autoscaled  product  over 
$2,  T-2_  1S  the  theoretical  autoscaled  product,  with  the  last  column  exhibiting 
the  error.  The  test  software  operations  in  either  a  deterministic  or  statis¬ 
tical  mode.  In  either  mode,  the  user  specifies  the  choice  of  moduli  (ie: 

P={p-j  ,P2»P3})  and  the  number  of  fractional  bits  used  to  define  the  table 
lookup  data.  That  is,  the  output  wordlength  is  given  by  [log2M]+  number  of 
fractional  bits.  In  the  deterministic  mode,  all  possible  values  of  x  and  y 
over  [0,N)  are  tested.  However,  if  N  is  large,  long  execution  delays  can 
result.  To  overcome  this  problem,  a  statistical  approach  may  be  used  where 
the  integer  value  of  x  and  y  is  randomly  chosen  from  a  uniformly  distributed 
process  over  [0,N).  The  test  is  repeated  M  times  and  the  statistics  analyzed. 
The  software  presents  to  the  user  both  error  mean  and  error  standard  deviation. 
For  example,  for  P={7,8,9}  and  zero  fractional  bits  of  accuracy,  the  determin- 
istics  error  mean  and  standard  deviation  was  determined  to  be  e=-. 0001 1476  and 
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FIGURE  9 


45 


X 

Y 

Zl 

2  2 

ERROR 

1  A 

n 

|  .  (  )(.<!  t(  It  H'l 

1.1 .  1  U)(  II  >.  X  > 

••  1  .  <  X  '<  > 

1  A 

t 

i .  oooooo 

0.  333:333 

-o.  4  A  AAA 

1  A 

•r* 

J  .  O'»000O 

1 . 0AAAA7 

0. 0AAAA7 

.1  A 

*» 

1 . 0rw.)000 

l . 600000 

0. AOOOOO 

.1  A 

4 

2.  OOOOOO 

4  •*>  ••>  “*!• 

•  1. 

-0.  A:  A  AAA  7 

)  A 

r r 

3. OOOOOO 

2. AAAAA7 

-0. 333332 

.1  A 

a 

3 . OOOOOO 

200000 

o.  r:ooooo 

1  A 

7 

'‘>00000 

#  7;*:;  3  T; 

v  •  /  •.;•  *■  :* 

16 

A , O0OO00 

4 . 266667 

—  J  */’*;  ’*V*I  ^  *1 

.1 A 

*•> 

6.  OOOOOO 

4 . 300000 

- 1 . 200000 

JA 

J  0 

A. OOOOOO 

r.r  “•••-»  *7*  ^ 

-o. AAAAA7 

1  A 

1  t 

A . OOOOOO 

3. 8 A AAA 7 

— 0 . 133333 

1  A 

OOOOOO 

A. 4000A0 

«.  j  ,  AOOOOO 

!  A 

t  3 

8 . OOOOOO 

A . 933333 

- 1 . 0AAAA7 

J  A 

J4 

OOOOOO 

7. 466667 

-0. 333333 

1  A 

1 3 

8. OOOOOO 

3 . OOOOOO 

0. OOOOOO 

.1.  A 

J  A 

.1. 0 .  OOOOOO 

p  9 1533334 

--J  .  4  AAA  A  A 

.1  A 

17 

1 0 . OOOOOO 

9. 066667 

—  (*)  O  **;•  *2.; ;*•  ;*!; 

16 

1  8 

to. OOOOOO 

9 .  AOOOOO 

-0, 400000 

J.  A 

19 

10. OOOOOO 

1.  U  #  t  O'  !1  s*' 

0. 133333 

16 

2<y 

1 3 . OOOOOO 

1 0. AAAAA7 

_  *7.*  •'.*«  ;’Ij ; -j 

1 6 

21 

J.  3 .  OOOOOO 

J l . 200000 

- 1 , 300000 

16 

•»*> 

13. OOOOOO 

J 1 . 733334 

-1 . 2AAAAA 

1A 

•i*  ««• 

13. OOOOOO 

1 2 . 266666 

-0. 733334 

1A 

24 

15. OOOOOO 

1 ? . 300000 

-2. 200000 

16 

23 

1 3. OOOOOO 

«  O  '.’iOOOO 

J 

-• !. .  666667 

16 

26 

1 3 . OOOOOO 

13.S66667 

-1. 133333 

1  A 

27 

1 5 . OOOOOO 

14. 400000 

-0. AOOOOO 

JA 

28 

J  7. OOOOOO 

1 4 . 933333 

-2. 06 AAA 7 

16 

•  *•  C' 

/.  :• 

17.000000 

1 5. 46666 A 

-1 . 333334 

multiplier  operation 


FIGURE  10 


46 


cQ-. 00379'30.  In  the  statistical  mode,  the  results  were  e=-. 00023643  and 
oe-. 00396498,  which  can  be  seen  to  be  in  close  agreement.  Table  8  and 
figure  11  summarizes  the  results  of  several  experiments.  They  are: 

1.  Deterministic  for  P= (3 ,4 , 5} 

2.  Statistical  for  P={7,8,9} 

3.  Statistical  for  P={15, 15,17} 

4.  Statistical  for  P={31,32,33} 

for  various  choices  of  fractional-bit  accuracy  (denoted  NM) .  The  error 
standard  deviation  data  has  been  interpreted  in  graphical  form  in  Figure  6 
and  compared  to  usual  theoretical  model  given  by  oq2=Q2/12  or  ae=Q//T?. 

Here  Q  is  the  quantization  step  size  which,  over  S?,  is  given  by  (H/p^. 
The  data  is  shown  to  be  in  close  agreement  with  the  theoretical  model. 
Lastly,  it  can  be  observed  that  the  multipliers  performance  is  more-or-less 
invariant  to  the  number  of  fractional  bits  used  to  generate  the  tables. 
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PART  II  SYSTEM  DESIGN  IN  THE  RNS 

Under  the  AFOSft  grant,  new  residue  arithmetic  units  were  conceptualized, 
researched,  and  tested.  Key  breakthroughs  were  an  efficient  RNS  to  decimal 
converter  and  an  autoscale  multiplier.  In  this  section,  these  building  blocks 
will  be  put  to  use  in  designing  high  speed  digital  systems. 

Tne  utility  of  the  RNS  in  digital  filtering  was  forwarded  by  Jenkins  and 
Leon  [1]  through  their  work  in  non-recursive  filtering  (FIR:  finite  impulse 
response).  In  this  case  the  problem  of  register  overflow,  in  the  RNS,  was 
overcome  through  the  use  of  a  norm  argument.  Given  a  FIR,  satisfying 
y(n)=Ia.x(n-i) ,  i=l,. with  |x(i)|<jl,  it  follows  that  |x(n) j]<£|a. { =V 
over  all  i.  In  order  to  insure  that  dynamic  range  overflew  will  not  occur, 
the  RNS  dynamic  range  M=mp.  would  be  chosen  so  that  M>¥.  However,  the  design 
of  recursive  filters  (FIR:  infinite  impulse  response)  is  significantly  mere 
complex.  Soderstrand  [4]  approached  the  problem  through  base-extension  methods. 
Other  authors  have  used  the  Chinese  Remainder  Theorem  (CRT)  or  mixed  radix 
conversion  algorithm  to  control  dynamic  range.  This  has  been  strongly 
criticized  because  of  the  overhead  norma1 ly  associated  with  these  operations. 
The  RNS  concepts,  developed  in  nart  I,  overcome  many  of  these  objections. 
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The  classic  digital  filter  architecture,  often  referred  to  as  the 
Jackson-Kaiser-McDonald  (JKM)  filter,  realizes  a  filter  in  terms  of  general 
purpose  multipliers,  adders,  and  shift-registers.  In  the  mid-lS70's,  several 
new  memory- intensive  linear  shift-invariant  digital  filter  architectures  were 
introduced,  first,  the  distributed  filter  [Peled-Liu],  or  PL  filter,  was 
introduced  in  1974.^^  Next,  the  Monkewich-Stunaart,  or  M-S,  filter  was 
reported  in  1975.^^  All  three  architectures  are  summarized  in  Figure  12. 
Compared  to  conventional  architectures,  this  class  of  memory  intensive  filters 
offer  the  potential  for  high  throughput.  Execution  speed  is  achieved  by 
replacing  the  relatively  slow  process  of  general  digital  multiplication  with 
table  lookup  scaling  operations.  Jenkins  and  Leon,  in  1977,  studied  a  memory 
intensive  filter  architecture  based  on  residue  [modular]  arithmetic.-^  By 
exploiting  the  parallel  nature  of  the  residue  numbering  system,  and  using 
table  lookup  operations  to  mechanize  modular  arithmetic,  ultra-high  speed 
digital  JKM  filters  were  realized  (see  Figure  13).  In  most  reported  cases, 
a  residue  arithmetic  filter  is  organized  into  a  decimal  to  residue  encoder 
stage,  arithmetic-filter  section,  and  residue  to  decimal  converter  stage. 

In  this  work,  the  fundamental  structure  of  the  residue  arithmetic-filter  sec¬ 
tion  will  be  developed  and  new  results  presented. 

One  of  the  principal  limitations  to  the  residue  concept  is  its  intolerance 
of  register  overflow.  This  is  a  consequence  of  finite  ring  theory.  Specifically, 
for  a  set  of  relatively  prime  moduli,  say  P  =  {p-j  ,p£,. . .  ,p^},  the  residue 
representation  for  a  signal  integer  i,  is  given  by  i-Ki-j  2 ’ •  *  *  >  where 

i  mod  p.  if  i>0 
(M- 1 i |)mod  p.  if  i<0 

J 


x(n)  INPUT 


FIR  RMS  FILTER  ARCHITECURE 
FIGURE  13 


k 

and  0<i -<p .,  M=np.  ,  k=l  ,2, . . .  ,i..  The  integer  i  will  have  a  unique  repre- 
sentation  if  and  only  if  -M/2<i<M/2. 

In  order  to  insure  the  satisfactory  performance  of  a  residue  arithmetic 
filter,  dynamic  range  overflow  cannot  be  tolerated.  If  for  example,  a  shift- 
invariant  filter  of  the  form  y(k+l)=Za^y(k-i)+Eb^x(k-l)  is  considered,  the 
£q0  bound  on  y(j)  ( i e :  max  (y(j))  for  all  j)  must  be  less  than  M/2  otherwise 
uniqueness  can  be  guaranteed.  As  a  result,  the  OKM  residue  arithmetic  filter 
suffered  from  a  severe  dynamic  range  restriction.  For  example,  in  order  for 
z=ax+by  to  be  correctly  represented  in  a  residue  system  z  must  be  bounded  by  M. 
For  0<a<A,  0<b<B,  0<x<X,  0<y<Y,  then  AX+BY<M.  If  A,B,x,y  are  on  the  order  of 
r-bits  of  precision,  then  M  must  be  on  the  order  of  2r+l-bits  in  range.  How¬ 
ever,  this  is  not  the  only  constraint.  If  highspeed  RAM  or  ROM  is  to  be  used 
to  perform  the  algebra  (in  a  lookup  sense),  then  for  n=14,  a  table  addressing 
space  of  29-bits  would  be  required.  Suppose,  for  the  sake  of  discussion, 

M'^-bits  and  n/16  bits.  Using  a  16K  highspeed  memory  unit  (30-50  nsec)  as 

ri  1  7 

a  model, LJ  the  maximum  value  of  a  moduli  p^  is  2  =128.  In  order  to  achieve 

the  33-bit  system  dynamic  range  ( i e :  M),  at  least  5  (ie:  [33/7])  moduli,  on 

the  order  of  7  bits  each,  must  be  used.  This  means  that  five  parallel  paths, 

complete  in  memory  and  logic,  must  be  configured  and  integrated  into  a  complete 

system. 


Footnote  1:  16Kxl  units:  INTEL  2167:  access  line=40ns,  enable  time=40ns, 
cycle  time=40n$,  active  power  500mW,  standby  power=75mW: 
INMOS  1400,  access  time=30ns,  enable  time=35ns,  cycle  time 
30ns,  active  power=375ns,  standby  power=35ns. 


, . ill! . PWW . . . . 


M-S  RESIDUE  ARCHITECTURE  (MSR) 

The  algebraic  operations  found  in  a  linear  shift  invariant  filter  are 
data  delays,  additions,  and  scaling  (in  lieu  of  general  multiplication). 
Replacing  general  multipliers  with  residue  scalars  has  been  proposed  by 
several  authors. ^ >2,14,18,19]  ^  residue  multiplier  would  present  a  2r-bit 

two  tuple  (a^,x.)  to  a  ?r  word  memory  unit  and  respond  with  the  precomputed 
value  of  (a-x^  mod  p- .  Using  the  same  2r-word  memory  unit,  a  scalar  v/ould 
accept  a  2r-bit  representation  of  x. .  The  table  would  respond  with  the 
precomputed  value  of  (-a^x.)  mod  p^  where  a^  is  known  apriori.  For  example, 
using  three  16K  NMOS  30  nsec  static  RAMs  and  three  moduli  of  the  form 
P={2n-1 ,  2n,  2n+l},  a  scalar  having  a  dynamic  range  on  the  order  of  42-bits 
can  be  realized.  Using  such  scalars,  the  M~S  filter  architecture  found  in 
Figure  12  can  be  realized. 
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FINITE  W0RDLEN6TH  EFFECTS 

Generally,  a  digital  filter  is  a  finite  precision  approximation  to  some 

user  defined  discrete  filter  defined  over  a  real  coefficient  field.  The 

errors,  due  to  finite  wordlength  effects,  are  w  1  documented.  It  is 

generally  assumed  that  the  expected  truncation  error  variance,  per  multi- 

o 

plier,  is  given  by  Q/2  and  Q  /1 2  respectively  (0  represents  quantization 

level).  However,  in  a  residue  arithmetic  filter  must  be  defined  over  a  ring 

of  integers.  Real  numbers  cannot  be  tolerated.  For  example,  suppose 

a=3.251  and  x=10,  then  ax-32.51.  Rounding  this  results  would  yield  an 
estimated  product  33.  In  a  residue  sense,  with  respect  to  a  moduli  set 

P=( 3,4,5);  M=60,  ,x2,x3>0  ,2,0) ,  one  may  make  two  sets  of  calcu¬ 

lations,  namely  (i)  and  (ii). 


a  = 

3.251 

ii) 

[a]  = 

3 

ax  = 

32.51 

ax  = 

30 

<axl>3  = 

<3. 251 >2 

=  .251; 

[251]  =  0 

= 

<3>3 

=  0 

<ax2>4  = 

<6.502>4  : 

=  2.502; 

[2.502]  =  3 

<ax2>4  = 

<6>4 

=  2 

<ax3>5  = 

<0>g  =  0; 

[0]  =  o 

<ax3>5  = 

A 

O 

V 

cn 

=  0 

The  calculations  found  in  column  i  use  the  decimal  value  of  a  in  forming 
product  (ax)  modulo  p^.  The  resulting  products  are  then  rounded.  The  final 
residue  digits  are  (0,3,0)  which  is  equivalent  to  a  decimal  value  of  15.  In 
column  ii,  the  integer  value  of  a  is  used  to  form  the  product  ax  in  the  usual 
residue  arithmetic  sense.  The  result  is  seen  to  produce  the  correct  truncated 
value  of  product,  namely  (0,2,0)--^  30.  Therefore,  since  all  filter  para¬ 
meters  are  to  be  integer  value  over  [0,M],  traditional  finite  wordlength 
error  modeling  and  analysis  techniques  apply. 


ki'iMNilliW  . I . . . . 
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If  a  large  dynamic  range  is  required,  in  limited  RfIS  hardware,  magnitude 
scaling  is  required.  A  similar  strategy  is  used  in  designing  filters  using 
weighted  fixed  point  arithmetic  where  rounding  or  truncation  is  used  to 
control  the  growth  of  dynamic  range.  In  a  RNS  system,  the  problem  is  com¬ 
pounded  by  the  fact  that  the  magnitude  of  a  number  must  be  Known,  if  it  is 
to  be  scaled,  and  magnitude  determination  in  the  RNS  is  difficult.  That  is, 
in  order  to  support  scaling  in  the  residue  number  system,  some  sort  of  residue- 
to-decimal  conversion  is  required.  Most  existing  residue  scaling  routines 

f3l 

makes  use  of  base  extension  or  mixed  radix  conversion  schemes.1-  J  It  has 
been  shown  in  reference  [15]  that  in  a  realistic  RNS  system,  a  ten  to  twenty¬ 
fold  increase  in  computational  overhead  can  be  expected  if  scaling  is  present. 
As  a  result,  the  overall  throughput  of  an  IIR-RNS  filter  would  be  compromised. 

In  order  to  achieve  high  data  rates,  over  realistic  dynamic  ranges,  in 
limited  hardware,  a  new  low-overhead  RNS  arithmetic  unit  must  be  developed. 

In  the  next  section,  such  a  unit  will  be  presented. 

M-K  RNS  FILTER  ARCHITECTURE 

In  order  to  insure  the  uniqueness  of  a  modular  product  of  two  numbers  of 

2  2 

dynamic  range  V,  the  modular  dynamic  range  must  be  bounded  by  V  (ie:  V  >M). 
This  can  be  achieved  through  the  use  of  a  newly  developed  auto-scale  policy. 

The  auto-scale  arithmetic  units  will  be  shown  to  support  memory  intensive 
filter  architectures.  Assumed  that  there  is  a  practical  memory  wordsize 
constraints.  For  high-speed  (v30  ns)  applications,  memory  size  is  presently 
limited  to  4  to  16K  words  (ie:  12  to  14-bits). 

For  reasons  that  will  become  self-evident  later  in  this  section,  the 
three  moduli  system,  given  by  P={p^=2n-1,  P2=2n,  P2=2n+i},  will  be  considered. 


) 
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Using  such  a  moduli  choice,  signed  integers  Xe[[L-M/2],  [M/2]]  arc  uniquely 
represented  by  the  three-tuple  (x^^jX^)  v/here  xj=x^  mod  .  In  order  to 
scale  x,  using  memory  table  lookup  operations,  the  magnitude  of  x  must  be 
known.  That  is,  the  RNS  three-tuple  (x-|,X2,x0)  must  be  simul taneously 
presented  to  a  memory  module  which  is  programmed  to  output  (cx)  trod  p. 
where  cxr[[l-M/2J,  LM/2]].  For  high-speed  application,  the  limiting  16K  = 

214  memory  units  require  n^p.^2^n%2^.  That  is,  the  design  would  be  con¬ 
strained  to  consider  moduli  on  the  order  of  4-bits  each.  Also,  the  dynamic 
range  is  limited  to  14-bits.  Referring  to  Figure  7,  it  can  be  observed  that 
an  integer  ><e[L-M/2] ,  LM/2]],  satisfying  ^kp^,  0<k<p-j P3-I ,  has  the  unique 
RNS  description 

x  *.((x-|-X2)mod  p.j ,  (x2-X2)mod  p^,  (x^-x^mod  p^)  37. 

=  ((x^^Jmod  p^ ,  0,  {x2"x2)mod  p^) 

"  (X1 >0>x3) 

Observe  that  x"  is  defined  over  a  subcover  of  [L-M/2],  LM/2]],  and  it  can  be 
uniquely  represented  as  the  two-tuple  (x^ ,  x^) .  The  two  tuple  approximation 
of  x,  namely  x,  establisnes  a  memory  size  constraint  given  by  2^n<2^4  or 
n£7.  Now  7-bit  (vs.  4-bit)  moduli  are  admissible)  with  the  dynamic  range 
extended  to  3n  or  21-bits.  The  memory  units  can  be  programmed  to  output 
[cx]mod  p..  where  c  is  a  user  specified  constant.  Overflow  prevention  can 
be  achieved  by  introducing  a  scale  factor  K  so  that  [cx/K]=[c'x]  will  not 
exceed  the  permitted  RNS  dynamic  range.  For  the  application  under  study 
(viz:  IIR-RNS  filtering),  it  shall  be  assumed  that  all  system  variables  and 
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constants  belong  to  the  integer  range  [-M/2,  M/2)  where  M=pTP2P^=2'in-2. 

Therefore,  the  scale  factor  K  needed  to  insure  the  absence  of  arithmetic 
overflow,  is  K=M/2. 

The  error  statistics  associated  with  each  auto-scaled  multiplication 

2  2  2 

can  be  shown  to  be  bounded  in  mean  by  p£c/2M  and  in  variance  by  a  =p^  /36M  . 

This  has  been  experimentally  verified:  For  example,  for  P  =  (15,16,17)  and 
(255,256,257),  the  error  variance  for  the  integer  valued  product  y=cx,  for 
ce[0,  M]  given  and  xc[G,M]  randomly  choosen,  is  plotted  in  Figures  14  and 
15.  The  error  is  defined  to  be  e=(cx/M-[cx/M])  where  x=kp„,  kc(0,p,po-l ]. 

A  M-S  recursive  RMS  filter  can  be  architected  using  the  .-..co-scale 
arithmetic  unit.  A  dedicated  auto-scale  unit  must  be  configured  for  each 
unique  filter  coefficient.  Each  unit,  in  the  three-moduli  case,  would 
require  three  memory  devices  each.  For  example,  a  9  coefficient  21 -bit 
resolution  filter,  based  on  4Kxl  RAM/ROM,  wocPd  require 

N  =  9  3  6  -  162  (16KxU  memory  units 

coef.  moduli  wordlength  per  moduli 

A  detailed  description  of  an  autoscaled  arithmetic  unit  is  found  in  Figure  16. 

The  modulo  2n-l  and  2n+l  adders  can  be  realized  in  the  manner  suggested  by 
TaylorP^  This  architecture  uses  PLA’s  to  augment  a  conventional  n-bit 

T131 

integer  adder.  Other  realizations  have  been  reported  in  the  open  literature. 

For  example,  a  modulo  2n-l  adder  can  be  realized  in  a  simple  straightforward 


Footnote  i:  Unlike  a  2's  compliment  system,  where  partial  sum  overflow 
can  be  tolerated  if  the  complete  sum  of  products  is  bounded, 
each  partial  sum  must  be  bounded  to  [-M/2,  M/2)  in  the  RNS. 
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manner.  Consider  the  term  (x,-x2)  mod  Al  with  x,eZ2"-l  and  x^n.  For 
positive,  (x,-x2)  mod  2n-Hxrx2)  but  for  (,,-Xj)  negative, 

(  j  1,10(1  2  1  ,xl‘x,[2]  wfiere  1  xi ~x2 1  [2J  Oenotes  the  complement  of  the 
binary  representation  of  |x,-x2|.  For  example,  -5  mod  7  =  101^=010^=2. 

If  a  real-time,  or  pipelined  architecture  is  reguired,  then  it's  ’ 
desired  to  design  the  modular  adder  which  have  identical  propogation  delay. 

Using  the  PLA-supported  architecture,  modulo  2n*l  adders  can  be  reamed  in 
commercially  available  hardware,  for  n<12,  having  a  30  nsec  delay. 

A  second  arithmetic  operation  found  in  the  three  moduli  namely  the 
computation  of  v=(2"'>a)  mod  2"-l ,  where  A-((xr*2)»d  2n-l)  -  {(x^mod  2"+l), 
can  be  simply  computed.  It  is  directly  verifiable  that  Ae[-2n,2n-2].  consider 

\  2A  j  +Aq  i  f  A^O 

(_-2A-j-a0  if  A<oJ  38 • 


where  AQeZ2  and  A^Z^.  For  A>0,v  can  be  computed  usinq  the  following 
scheme;  n-1  , 


V  :  a 


Example  6=6,  n=3,  (4-6)  mod  7=3. 


1  1  0 


0  11 


For  A<0,  v  is  given  by  [c  denotes  complement] 


^t-L- - LL  1 

1 —  phantom  sign-bit 
=-6,  n=3,  (-6*4)  mod  7=4 


Example; 
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i 

i 


0 

i  i 

Ll_ 

0  0 

As  a  result,  v  can  be  computed  with  negligable  overhead  and  hardware, 
then 


A 

(  2n-1‘  A  )mocl2n~l 

A(  b i na ry ) 

A  '  C  !>  i  nary ) 

_  AJ  (  ±c  i  r 

6 

<2d>?=3 

1  10 

01 1 

3 

5 

<20 >7=6 

101 

110 

6 

4 

<16>?=2 

100 

010 

»> 

3 

<12>?=5 

01 1 

101 

5 

2 

<8>7=1 

010 

001 

I 

1 

<4>?=4 

00  1 

100 

'1 

0 

<o>7=o 

000 

000 

0 

The  discussed  parmutation,  from  a  modulo  2n-l  adder  to  a  buffer  register, 
can  be  realized  by  a  hardware  mapping.  Here  the  LSB  of  the  adder  is 
connected  to  the  MSB  of  the  buffer.  The  other  (n-l)-bits  are  mapped  to  the 
buffer  with  shift  of  one  bit  location. 

RNS  FILTER  DESIGN 
MK  Architecture 

In  a  M-K  architecture,  eacn  filter  coefficient  c^  is  realized  with  a 
dedicated  RNS  table  lookup  unit.  Based  on  the  three  moduli  MRC  algorithm 
and  a  given  c.,  a  MRC  multiplier  unit  similar  to  the  one  detailed  in  Figure 
would  have  to  be  configured.  Here,  for  convenience,  the  multiplier  2n~^ 
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is  imbedded  into  a  lookup  table.  Each  unit  would  consist  of  nine  2nxn-bit 

memory  units.  It  must  be  stated  that  in  O'-der  to  use  a  ?n*n  memory  in  a 

mcdulu  (2"+l)  operation,  some  form  of  data  compression  is  required.  The 

simplist  compression  routine  would  differentiate  between  the  two  external 

number  in  Z  ,  namely  0  and  2n.  Those  two  numbers  have  a  (n+l)-bit  (re: 
2n+l 

common  n-bit  data  bus  plus  1-bit  control  line)  representation  of  00... 00 
and  10... 00  respectively.  By  ANDing  the  n  LSB's  and  sensing  the  ?-£>3,  the 
two  conditions  can  be  easily  identified.  For  x=0,  it  follows  that  [cx/M]=0 
and  the  output  registers  would  be  zeroed  without  any  memory  (table  lookup) 
action.  This  means  that  one  of  the  2n  memory  addresses,  namely  x=G,  is  super 
fluous  and  can  be  assigned  to  x=?r,+l. 

It  follows  directly  from  the  MRC  representation  that 


xc-  x.c-  x9c,p,  xxciPiP-i 

1-frJmod  pj  =  ( [— pj— ]mod  pi  +  -‘]mod  p.  +  [-^ — -]mod 

That  is,  the  outputs  of  modular  tables  (viz:  [x^c^/Mjmod  pj. 


P-)mod  p-  39. 

%i  J 

. .  ,  ■}  P]  Pg/M] 


mod  p^)  must 
>/ 


From  a  design  standpoint,  it  is  desired  to  configure  a  system  which  has 
minimum  complexity.  The  31=6  possible  permutations  of  a  three  moduli  set  are 
summarized  in  Table  9  in  general  and  for  the  specific  example  x=100  for 
P=(7,8,9).  A  key  feature  of  the  general  architecture  are  the  propogation 
delay  paths  d^.  (total  delay),  d-j  and  d2  (see  Figure  77).  For  sequential 
operation,  the  MRC  digits  will  be  available  for  use  after  t^  seconds.  The 
lengh  od  delay  is  due  primarily  to  the  nesting  of  four  multipliers.  In 
addition,  is  a  function  of  the  multiplication  philosophy  used  (bit-serial. 
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lookup,  general  purpose,  etc.).  If  high  throughputs  and  low  complexity  is 
desired,  t^  should  be  minimized.  One  could  also  consider  a  pipelined 
architecture  of  depth  two  having  an  effective  throughput  rate  of  t£  second 
per  MRC.  The  design  of  an  efficient  pipelined  MRC  processor  is  promised  on 
the  condition  that  t^  is  small  and  t-j  and  t2  are  comparable.  Referring  to 
the  data  summarized  in  Table  9,  it  would  appear  as  though  the  first  ordering 
admits  the  best  design.  Therefore,  this  ordering  will  be  used  as  the  model 
throughout  this  section.  Based  on  this  model 

Vx  2  40. 

X2~^X?~X3^  mod  (2n+l) 

X3=(2n"^  {(x-|-X2)mod(2n-l  Mx^-x^mod^H+l  )})mod(2n-l ) . 

The  2n+l  adders  found  in  this  architecture  have  been  previously  discussed. 

In  a  M-K  architecture,  each  filter  coefficient  c^  is  realized  with  a 

dedicated  RNS  table  lookup  unit.  Based  on  the  three  moduli  MRC  algorithm 

and  a  given  c^ ,  a  MRC  multiplier-scalar  can  be  configured  as  suggested  in 

Figure  18.  Here,  for  convenience,  the  multiplier  2n~  is  imbedded  onto  a 

lookup  table.  Each  unit  would  consist  of  nine  2nxn-bit  memory  unit.  It 

must  be  stated  that  in  order  to  use  a  2nxn  memory  in  a  modulo  (2n+l)  operation, 

some  form  of  data  compression  is  required.  The  simplist  compression  routine 

would  differentiate  between  the  two  external  number  in  Z  ,  namely  0  and 

2n+l 

2n.  Those  two  numbers  have  a  (n+l)-bit  (ie:  common  n-bit  data  bus  plus  1-bit 
control  line)  representation  of  00... 00  and  10... 00  respectively.  By  ANDing 
and  n  LSB's  and  sensing  the  MSB,  the  two  conditions  can  be  easily  identified. 


cx  module 


MS  FILTER 
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For  x=o,  it  follows  that  [cx/M]=0  and  the  output  registers  would  be  zeroed 
without  any  memory  (table  lookup)  action.  This  means  that  one  of  the  2n 
memory  addresses,  namely  x=o,  is  superfluous  and  can  be  assigned  to  x=2n+l . 
It  follows  directly  from  the  MRC  representation  that 


XC. 


mod  p.=( 


xlci 


M 


mod  p.+ 


x2ciPl 


mod  p. 


x3ciplp3 


M 


mod  p.)mod  p, 

J  J 


That  is,  the  outputs  of  modular  tables  (viz:  jx^c^Mjmod  p^.,...,  X3C1- P] P3/M ! 
mod  p.)  must  be  recombined,  in  an  additive  modulo  p.  sense.  Again,  a 

J  J 

sequential  or  pipelined  architecture  can  be  realized. 

Exampl e 

Using  an  MS  architecture,  a  4th  order  Chebyshev  filter  was  realized. 

The  response  is  reported  in  Figure  19.  It  has  been  experimentally  determined 
that  it  requires  a  16-bit  moduli  three-tuple  to  achieve  satisfactory 
performance. 
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JKL-RNS  Architecture 

The  disadvantage  of  the  MRC-RNS-MK  filter*  is  the  rapid  growth  of 
hardware  as  a  function  of  filter  order.  In  order  to  reduce  hardware  com¬ 
plexity  associated  with  RNS-FIR  filtering,  reusable  (undedicated)  overflow- 
free  RNS  arithmetic  units  may  be  considered.  Again,  if  overflow  scaling  is 
imbedded  the  table  lookup  operations,  the  magnitude  of  RNS  coded  numbers 
must  be  known.  As  previously  noted,  this  has  been  the  historical  obsticai 
to  IIR  filtering  in  the  RNS.  The  architecture  which  can  achieve  this  goal 
is  detailed  in  Figure  18.  The  multiplier  2n_1 ,  as  previously  noted,  is  a 
zero-overhead  operation.  A  timing  diagram  is  offered  in  the  referenced 
figure.  It  is  assumed  that  the  modular  addition  delays  are  less  than  the 
lookup  table  access  times  (say  tM) .  The  difficulty  with  this  proposed 
architecture  are  the  delays  associated  with  reprogramming  the  tables,  for 
each  c.. ,  from  high-density  low-speed  (comparatively)  main  memory  or  mass 
storage.  As  a  result,  the  overall  throughput  of  this  architecture  will  be 
unattractive. 

Distributed  RNS  Filter 

A  powerful!  linear  constant  coefficient  filter  policy  is  distributed 
arithmetic  (alias:  bit-serial,  bit-slice,  or  Peled  and  Liu  filter).  In  a 
B-bit  radix-2  binary  weighted,  the  output  of  a  discrete  filter,  satisfying 

n  n 

y(n)=  7,  a.y(n-i)+  I  b-x(n-i)  41  - 

i=l  1  i=0  1 


is  given 


;o 


B-l  n  n  .n  n 

y(n)=  (  l  a.y(n-i;j)+  £  b.x(n-i  >:  a-y(n-i.B)-  T.  b-x(n-i  .B)2P' 

j=1  1=1  3  1=0  1  j=0  1  0  j=0  1  3 

with  j  denoting  bit  location.  An  equivalent  statement  for  RNS  systems  can 
be  made  in  the  RNS  using  the  MRC.  Here,  a  system  variable  would  be  given  in 
MRC  form  as 

Z=Z+Z2P1+Z3P1P2 

There  is  a  minor  structural  between  a  distributed  filter  using  a  MRC  and 
radix-2  format.  For  a  three-moduli  system,  distributed  partial  products 
must  be  recombined  modulo  p-|,p2*  and  p^.  Table  requirements,  for  this 
architecture,  are  correlated  directly  to  n,  the  order  of  the  filter. 

Bit  Slice 

Example:  A  2nd  order  discrete  Chelyshev  filter  was  designed  in  the 
usual  way.  The  infinite  precision  response  used  double  precision  floating 
point  arithmetic.  It  can  be  noted,  from  the  data  displayed  in  Figure  20 
that  three  -8  bit  moduli  filter  performs  better  than  a  12-bit  fixed  point 
filter  and  closely  approximates  16-bit  precision  using  a  4th  order  model, 
8-bit  moduli  can  again  be  seen  to  offer  acceptable  performance  (see  Figure 
21). 
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PART  III  APPLICATIONS 

The  RNS  arithmetic,  developed  under  this  AFOSR  grant,  has  been  tested 
in  a  MS,  JKM,  and  distributed  architecture.  Several  applications  were 
considered.  The  uniqueness  of  these  applications,  are  on  to  themselves, 
worthy  of  special  treatment.  Therefore,  have  been  included  in  this 
report  in  appendices.  Appendix  A  treats  the  problem  of  realizing  a 
real-time  Kalman  filter.  The  development  filter  (submitted  for  publica¬ 
tions)  represents  an  original  approach  to  this  class  of  problem.  In 
Appendix  B,  a  linear  adaptive  noise  canceller  is  presented.  Appendix  C 
contains  other  papers  published,  or  under  review,  which  contain  an  AFOSR 
credit  line. 

PART  IV  SUMMARY 

Under  an  AFOSR  grant,  RNS  arithmetic,  hardware,  and  architectures 
have  made  major  strides.  Using  a  three  moduli  system,  practical  ultra- 
high  speed  RNS  units  have  been  developed.  The  major  problem  of  RNS-to- 
decimal  conversion  plus  magnitude  scaling  has  been  successfully  treated. 

In  addition,  new  filter  architectures  were  derived  and  analyzed. 

The  future  of  the  RNS  is  considerably  brighter  as  a  result  of  this 
study.  In  particular,  the  RNS  techniques  developed  during  this  grant 
period,  will  be  further  inhanced  with  the  advent  of  VLSI. 
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Ar.5Tr.ACT: 


An  adaptive  Kalman  filter  is  considered  for  which  the 
input  noise  covariance  end  the  output  covariance  ere  unk¬ 
nown.  A  new  procedure  is  presented  for  the  i dent i f i ce t ion 

K 

of  the  unknown  covariances.  The  proposed  identification  al¬ 
gorithm  uses  the  au tocor re  1  a t i on  information  contained  in 
the  innovation  error  sequence  to  determine  the  ratio  of  the 
unknown  noise  covariances  and  to  obtain  the  optimal  Hainan 
filter  ,qain.  The  proposed  method  can  ho  easily  implemented 
through  the  use  of  hi.qh  speed  digital  autocorrelation  altor- 
ithms  which  operate  at  real  tine  data  rates. 


1.0 


nnr.wcTio:: 


Tlie  !Ca1nnn-"ucy  formulation  of  tiio  minimal  variance 
filf.rine  problem  assumes  complete  a  priori  knowledge  of  the 
input  and  output  noise  covariances,  say  P  rend  P .  In  most 
practical  applications  fl  and  P  are  either  assumed  to  ho  unk¬ 
nown  or  approx  Inin  ted .  Several  authors  have  presented 
schemes  for  the  Identification  of  the  unknown  covariances 
with  some  success  (2),  Of),  (7)  and  (11)  with  the  use  of  the 
innovation  sequence  in  the  identification  of  the  unknown  co- 
variances,  Introduced  by  i'ehra,  • 

i 

i-  t 

'* 

The  fd-ntl flection  schema  presented  In- this  paper  wakes 
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0%  •  *• 
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ir.e  of  the  first  delayed  va  1  tic?  of  the  nu  tocor  re  1  a  t  i  on  func¬ 
tion  of  the  Innovation  salience  to  determine  the  ratio  of 
the  'in':no,.,n  rover  i  anr.es  .  The  method  presented  is  applicable 
for  those  systems  ’.ditch  exhibit  a  sh  i  f  t- i  over  i  *.nt  (constant 
c.rv:  f  f  i  c  i  on  t )  property. 


A  birth  spree!  real-time  at  ror  i  thm  for  the  conpi.it  “it  I  on  of 
the  autocorrelation  function  has  been  presented  by  Pheifor 
and  7 1  ankensh i p  (12)/  (17.).  The  speed  of  the  algorithm  con 
ho  further  Improved  throur.li  the  use  of  special  codes  (in. 

in-line,  threaded,  hnotted)  which  n.-.l;c  possible  the  conpute- 

I 

tion  of  the  nti  tocor  re  1  n  t  i  on  function  in  reel  tine  (!•'»). 


2.7  r?r i Fr::/^LA::::E::s:!i p 


(bp) 


AUTOf  nnp.n  LAT I  O’!  ALOOPsI  Tin 


A  discrete  an tocor  rol  nt  i  on  function  r(!:),  is  riven  hy: 

N 

r ( I*. ) *  ^  f (n )  f (n+U)  7.1 

n  si 

If  r(k)  is  desired  for  I:  on  the  order  of  172,  then 
us i n"  FFTs  to  compute  OFT-1  (x( f  )x *(  f ) ) *r(!:)  Is  computet  ion- 
ally  optimal,  A  total  of  I!  1n»»2("+b')  complex  multiplies  are 
required  (n  complex  multiply  munis  i\  real  multiplies). 
Plreet  computation  of  the  above  requires  !12  real  nultl- 

1  '  v.  1  L  'AA'/  1j  P'f:  .  ,"'-v  s’’  ■ 

piles. 


The, 


version  of  the  autocorrelation  algorithm  satis¬ 


fies: 


p-i 


r  (  1;  )  s 


/P 


:+«+!;)(  f  ( ?J !:+  i )  +  f  (  2  j !:+  »  +2!; ) )  2.2 


For  ..>>?  the  HP  a  1  -or  i  thr.i  osson  t  i  all  y  halves  the  number 
of  multiplications  normally  associated  with  direct  computa¬ 
tion.  The  multiplication  count  for  !!>>P  can  also  be  consid¬ 
erably  loss  than  the  associated  with  FFT  mechanizations. 


The  speed  of  the  algorithm  can  be  further  improved  by 
el  ini  net i nr  the  time  consumed  to  compute  data  invariant  in¬ 
dices  Cic:  C2jk+fc>,  ( 2 j !;+  i ) ,  (2jl:*i+2U>  ).  This  can  be 
achieved  through  the  use  of  so  called  in-line  code.  Jn  an 


in-line  code,  the  code  is  arranged  in  a  top-down  fashion. 
Here  entry  is  made  at  the  top  and  without  looping  run  to 
completion.  The  desired  in-line  code  bavin.?  ail  data  invar¬ 
iant  parameters,  can  be  properly  computed  and  properly  se¬ 
quenced  through  the  use  of  AUTOHE?!  Methodology  (lb). 

Another  alternative  is  possible  through  the  use  of 
threaded  code.  It  replaces  a  standard  pro-ram  with  a  series 

oi  modules  ••.•Inch  are  threaded  together.  A  thread  is  a  pre¬ 
computed  data  array  in  which  alt  preroquist  information  is 
■  ound.  The  array  is  serially  scanned- at  fun  time  and  ft, ore- 


"mw'c 


p  \r  r  !j 

Torn  removes  the  overhead  assoc i  nted  vi  th  the  original  com¬ 
putation  of  the  parameters.  Compared  to  thrt  in-lino  code 
the  threaded  code  ?s  twice  ns  fast  and  occupies  loss  memory 
‘(lit). 

•\nothor  option  is  ava  i  1  n!>  ]  e  and  is  knovn  ns  n  knot  tod 

code.  Knots  will  he  tied  in  tlio  thread  indicatin':  that  the 

prop rr.m  v* ill  move  flown  the  thread,  "run-r, round"  inside  a 
knot  for  n  while  then  continue  down  tire  thread  to  the  next 
knot.  The  knot  represents  a  subprogram  in  which  there  may 
exist  elementary  looping.  Usin':  the  knotted  code  the  memory 
requirements  reduced  cons  i  do  rah  ly  (Ik).  It  has  been  shown 
in  reference  Ik  that  a  120-point  time  series  can  be  autocot — 
related,  (up  to  a  delay  of  11  samples)  csint  a  POP  11  mini¬ 
computer  the  following  time  intervals: 

PDP  11/55  PDP  11/70*  pnp  1.1/r.O*  Pro-rain  Size  -5 


SO 
7552 
7  CO 


Convent i ona 1 

0.G9 

0.72 

17. 

35 

In-line 

5.25 

r}.’-\2 

in 

.03 

"not  ted 

C .  70 

7.05 

ll 

r- 

*  Cache  I'nchincs 
2  I -orris 


Denlizin-  that  tiie  computation  of  the  autocorrelation 

function  can  be  performed  at  real  tine  speeds,  we  can 
proceed  to  the  analysis  of  the  problem. 


-'.n  ST.vrnr.E.’iT  nr  t: :h  pro-:  i.ei: 

A  discrete  stochastic  dynamic  system  con  be  represent,- 

or. : 

;d 

+  Hi  5-1.2 

=  "  +  !i  5.1.!) 

.die  re 

i  ='),  1, 

e,-  ~n>:l  state  vector 

-nxn  state  transition  matrix  (constant) 

li  =n”"  voctor  ""-us sinn  input  \:hite  noise 

:j=r::l  output  vector 

1  ~r''n  output  matrix  (constant) 

■  i  ~r“~  vector  of  '"'aussian  nensurerient  errors  (v.*h?te) 

t  is  assunori  that: 

5-2-  =  F<Uii'iT)='!6lj 

( V  j )  =o  5.5.0  7  , 

‘•  -iJtj  °«J  e.5.h 

:.*e  consider  the  system  to  be  o  lovpr.ss  or  bandpass 

ilter  and  completly  don troll  able  and  observable. 

mm  **mim  «tiu.-lo  „r  t!.C  ini  t.7,  state  ~0  ond 

ie  str,t0  covariance  Pf0(-)  an-:  .,,iVrm  the  -.priori  slat  1st i- 

: 

• 

Cut  i  n.  cr.-.nt  ion  or  the*  input  noise  cover!  once  01  end 
noise  cover mance  P-l  an  estimate  of  the  state  of  the 
d-oftnef:  by  C;-.l)  is  obtained  sequentially  for  |;  = 
with  the  standard  Pal  nan  filter: 


1.  State  estimation  Fxtrnpolat 


Lion: 


ak(-)  ;•  cK_1(  +  ) 


-rror  Covariance  Extrapolation: 

PEK(-)=fr^K.1(  +  )FT  +  0IK_1 


u.  State  Estimation  Update: 


h .  error  Cove r i ance  l:p-:!a to ; 

PEK(  +  )s(l-r.K!:)  PF.k(-) 


i\ a  1  na n  Coin  ! ‘.a t r i x : 

{:K=?:TKC-K;T(:;?nKC-}i:T+niKr1 


•jy  'lot  i  n  i  t  i  on : 


FTK(-)=r:(|K(-)i:K(-)T) 


v.*’io  re 


"(in  ifoiiniWt  <iiM‘iiiiiinwi  'iH’tliiiiiW1  iiiiMMHiii'liihti 


I  •/.".«*.  r, 

that  the  optimal  steady  stat^  -a  i  n  can  ho  defined  in  terms 
of  the  ratio  of  tho  truo  statistics  (Tvl/at  )=Cn/°) .  In  this 
caso  PE(-)  is  directly  proportional  to  PTC-)  by  a  factor  A 
whore: 

A=  (r./ni)  =  (n/PI)  f;.5 

Hy  expandin'*,  the  definition  of  PTC~)  in  ( a .  P )  *."»  ^et: 


PTC  - )  =  FPTC-)FT  +  d  -  >;.C 

pic  + )  =  ( t  -  xi!)  PTC  - )  ( i  )T  + 


J.  7 

*i  *  f 


The  fol  lowi nr  observations  can  be  made  record i nr  equa¬ 
tions  C^C),  (is.  7)  and  C5.f>). 


1.  PTC-)  depends  explicitly  upon  the  output  apriori 
statistics  of  tho  system  nodel  Cvi  7.  *  i\0  lnan  coin), 

and  the  true  noisn  covariances  P  and  a. 


JL  » 


2.  PRC-)  depends  upon  apriori  statistics  only. 


Consider  the  case  in  vh i  Ch-P. I  =P  and  f}!>n..  -  It  can  be 
seen  clearly  fror.i  equations.  C  7>  and  (h.7)  that 

PC ( - ) > PT (- ) .  Consider  also  the  al ternat i vo  case  in  which 
H  l  =  P  and  Q>Pl/  i  t  Is  cl  ea  r  that  F’TC-  )>pr.  (-) .  These' Cases 
Jvnc  iTTustrnteci  In  Fieures  1  C  h ),  1(b). and  lCc).  Here  each 
i  tern t i on  represents  the  Kadriari  Pid;terin^;pf  a  2 5 P  point 


. . . . . . . . . . . . . 


p  *  n  r  og 


t  i ;no  series  v/i  th  AI -P.  and  1!  chan-Tinc.  Tim  steady  state  co¬ 
variances  r.rc  plotted  Tor  finch  Iteration  with  the  ratio  of 
r,  I /II  distributed  1  oon  r  i  thmi  cl  y  from  .01  to  10. 


Observe  tint: 


The  crossover  point  of  the  curves  is  the  optimal 
case  in  which  wo  obtain  the’  optimal  Kalman  r.ain. 


The  theoretical  covariance  fs  increasing  after  th.fi 
crossover  point/  It  can  ho  explained  by  the  Kalman 
filter  Tain  which  is  decreasing  in  each  iteration. 
Therefore,  the  innovation  error  is  magnified. 


it  should  lie  mentioned  that  in  a  real  system'  PT(-) 
•Can  not  he  explicitly  determined. 


At  each  iteration  in  the  above  simulation  the  auto- 
corrclat ion  function  of  the  innovation  process  was  calculat¬ 
ed  directly-  Observe'  that  the  second  torn  of  the  eutpeorre- 
la  t  i  on  rune  t  i  on  (h.3)  s  tart  s  \  •'  i  t  h  n  e  <•  a  five  values  and  i  nem¬ 
eses.  !,hen  (%l;/ill ) =(R/0),  we  have  the  op t  i mn  1  case  i  n  v/h loh 
,r  ”k-i  )  =  ^  as  previousl  y  stated.  Th  i  s  is  5 11  list  rated  hy 

Figures  2(a),  2(h);  /(c).  If  (VA)Hr  1/fH)  the  Kalnan  ■  -a in 

:  >/-v ‘-v  ' 


. . . . . ***> . . . . . .  iiiiiW.nriai'iii  1 1 


■;(::pt(-):!t+p)  >  ?t(-):;t 

*..'n i o!i  napes  XvK”K-i  5  negative.  In  the  other  earn 
(P I  /Q 1 )  >  (HZ')) ,  the-  results  are  opposite. 


v.-’ic  ro 


r,.a  it::::  Ai.,r>  piionjinunn 

The  al  ';or i  th:.i  is  ha  sod  upon  the  autocorrelation  func¬ 
tion  of  the  i nnovoi ion  process. 


In  the  first  iteration,  the  first  t P.r  of  the  auto¬ 
correlation  P(P)  :;i''er>  o  "ood  estimate  of  P  (•'*./)  especially 
•.:!>an  (P/'.)>1.  Poch  iteration  represents  the  1'nTmr.n  filter- 


1 

! 

in",  o 

-C  A 

r. 

Z::C>  point  time 

series  :.*?  th  a  constant  P-f  and 

var  ?- 

rhle  0 

i . 

Vs  in.*;  as  PI  th 

a  P ( 0 )  of  the  first  iteration. 

forty 

vrl  ues 

of 

ri'I  are  chosen 

so  IPX  P  l  /pf  5  >  «7. 1 .  After  e  ach 

i  tcr- 

at  i  on 

the 

first  del  eye-*! 

valued  of  the  a • r toco r r  c 1  a  t.  i  6  n 

func- 

j 

t  ?  on 

!  5 

calculated  as 

a  function  of  the  ratio  rv!  /3l  * 

h'heri 

the  process  *s  over,  s  curve  fitting  of  these  points  mi  cl  the 
order  of  the  appronfnatiqn  is  detonnined.  USirf-j  the  H'-i s&&r 
fionf  algorithm,  the  value  of  A  at  which  PClTat  can  be  da— 
t  ermine  *5.  The  i  teration-  number  nt  which  P (1 )  =h  can  be  de¬ 
termined  arid  then  P.  can-  be  ca Tc.y T a  to  d  from  ( . ;» )  since  P  ( '* ) 
arid  iiPZX-ddf  have  been  SZorgiT:  s$  fiiftjpnS  the  i-tcra'tisoris' 
i  rideiti-  ^nof/j  n4  ;  tiVee  rPtfo be:  e;5sf  ly  ehtfcufSf®£*: 

'hipV'Tbd.fci  rii^  the  op'bratfne;  ebr-’e  can:  h^  used  to  dor  i  no 


on  adaptive  po  l  i  cy .  /.n  nooptivc  r.ir.on  thin  con  uo  derive:! 

U'hicii/  depending  upon  t!va  current  location'  on  the  oparat :i nr. 
C!M*vo  f  ccn  converge  to  the  optimal  cose  one!  t i  vn  the  opt  imp! 
r.  to  o  c!  y  .  s  t  o  to  r  1  r.ia  n  -  a  I  n . 


7 .he  oho vo  al.~ori  thm  con  !>o  easily  ii nl  e:  -enter!  In  a  mi¬ 
croprocessor  due  to  it's  s  inpl  i  c?  ty.  Peal  time  th  rou.ehpui: 
con  be  ochived  usinc  the  7P  algorithm  to  calculate  the  first 
tv.'o  tenor.  of  the  on  toco  r  ro  lot  ion  function. 


t.*  . •-zmc.M..  ii:::.::PL.7 


von  r  th  order  elliptic  lo-.rpass  filter  s  r  i j  ,i  *  1  o  tec!  on 
IP  ?.!/:?.?■  di,"ito1  .ai  croccmputor.  The  octroi  valuer.  of  V- 


one!  !!  ere: 


-  (  n  pH"  e  oro 

V  V/«*  ->-•  .. 


■ no.  prdti'n-n-  VT.s  :!e  vo  1  oped  such  feet  the  true  noi so  co 


•ncas  con  ,?e  in;5;rtn\:  •••;  :.*i~  aussian  su  >ro:*t  me 


c  i  per.  a:.:  nr 


Vi.e  prr  ram  y  5  el  :h 


ra  1 1  o 


-  1 


p.  ■/•!:  > 


2(c)  i  1  lustrntc  the  results,  The  Wots  represent  the  cTctuoi 
values  of  HO.)  at  each  iteration..  The  coordinates  of  then) 
ore  also  presented.  The  curve  fitting  of  these  points  is 
represented  by  the  con  ti  nous  line.  The  unS:no*.;n  noise 
statistics  of  the  systen,  IP  cor rospon  line  for  a  one!  OUT 
correspondin’  for  P.  pro  the  values  :n  tins  parenthesis  end 
the  I  r  estii.mt.ion  is  shorn  belov.*  then.  "Jesuits  ware  rood 
•.ri  th  sii  ninol  errors. 
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use  of  n  Pel  nan  f  i 
first  end  second 
the-  innovation  prc< 
the  un!;ho::n  cov'd  ri ; 
erf.  iriplc-r.vcnt atr  on  of  the  .  a T.r or i  thm  is  possible  end  real 
■time  th roti f.h put  can  be  achived  wi  th  the  use  of  fast  auto¬ 
correlation  nlrorl  tons.  'i'nut.rer  icai.  ox  crop  1  e  ill  ust  rates  the 
r.  1  ror  i  £hr>.  '  - 

There  are  two  minor  prriblens  \?i  th  too  airori thn. 

efftcri  the  output  to  input  noiOe  ratio  ?s;  'aery;  sihs’li 

( Tens  than  .  J) ,  re  1  a t  i  ve  precise  cal  cu 1  r  t  i  on  of  the 
r  actual  noise'  S-tnbist for.  is  **  Jif  f;i cni-t;  al  tlipiirh 
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PLOT  OF  PTC-O  AND  PEC-O  VERSUS  ITERATIONS 


APPENDIX  B 


ADAPTIVE  NOISE  CANCELLING 


Adaptive  Noise  cancelling  is  a  method  of  estimat¬ 
ing  signals  corrupted  by  additive  noise  or  interference. 
This  method  uses  a  'Primary'  input  containing  the  corrupted 
signal  and  a  'Reference'  Input  containing  Noise  which  is 
similar  to  the  primary  noise.  .  The  reference  input  is  adap¬ 
tively  filtered  and  subtracted  from  the  primary  input  to  ob¬ 
tain  the  signal  estimate.  . 


Adaptive  filtering  before  subtraction  allows  the 
treatment  of  inputs  that  are  deterministic  or  stochastic, 
stationary  or  time  variable.  When  the  reference  input  is 
free  of  signal  and  when  certain  other  conditions  are  met, 
the  noise  in  the  primary  input  can  be  eliminated  without 
distortion.  .  It  is  further  shown  that  in  treating  periodic 
interference,  the  adaptive  noise  canceller  acts  as  a  notch 
filter  with  narrow  bandwidth  and  the  capability  of  tracking 
the  exact  frequency  of  interference.  . 


Noise  cancelling  is  a  variation  of  optimal  filter¬ 
ing  that  is  highly  advantageous  in  many  applications.  It 
makes  use  of  a  reference  input  derived  from  one  or  more  sen¬ 
sors  located  at  points  in  the  noise  field  where  the  signal 
is  very  weak  or  undetectable.  .  This  input  Is  filtered  and 
subtracted  from  the  primary  input  containing  both  the  signal 
andthe  noise.  .  As  a  result,  the  primary  noise  attenuated  or 
elliminated  by  cancellation.  . 


If  done  improperly,  subtracting  noise  from  a  re¬ 
ceived  signal,  would  result  in  an  increase  in  the  output 
noise  power.  ,  However  when  the  filtering  and  subtraction  are 
controlled  by  an  appropriate  adaptive  process,  noise  reduc¬ 
tion  if  not  complete  noise  elimination,  can  be  acomplished. 
Adactive  filtering  may  not  be  applicble  in  all  of  the 
filtering  situations.  ,  This  adaptive  filtering  would  not  be 
possible  when,  for  example,  the  reference  noise  input  is 
unavailable.  .  In  circumstances  where  the  adaptive  noise  can¬ 
celling  is  applicable,  the  levels  of  noise  reduction  are 
often  attainable,  that  would  be  difficult  to  achieve  in  di¬ 
rect  filtering.  . 


. . >('! . If . I; . >«>:' .  W,lf  . . .  . . .  ,1 


PAGE  2 


MHo*  <j  T(tll  v  (j'j, 


I 


« 

f 


% 

% 


5 

f 


z=t s+nO-Y  *hn,,iH  h2 '®e,  Cancf)1i"S  systems,  the  system  output 

the  sfenal%  Tfwf  ?  best  *n.the  ,east  squares  sense  to 
tne  signal  S.  This  is  acompTIshed  by  feeding  the  svstem 

output  back  to  the  adaptive  filter  and  adjusting  the  filter 

rSm°Unihrn,\LMS  adaptlve  algorithm  to  minimise  thl  total  sys- 

“r  ,  ?“!•  the  system  °"tDut  serrves  as  S. 

error  signal  r  the  adaptive  process. 


THE  LMS  ADAPTIVE  FILTER: 


rho  L”S  adaptJve  ^ ' l ter  is  the  basic  element  of 

the  adaptive  noise  cancelling  systems.  .  The  principal  comDo- 

£?!!!  °f  adap*!v?  ?yst?ms  Js  the  adaptive  finear  com- 


ofn?nrputh?Tgna!"  ^Vner^ighs^and 

of  input  signals  to  form  the  output  signal.  The  input  slg- 

s  defined  as:  * 


nal  vector  X 


a  set 


V 


*0j 

X.  . 

Ij 


l  Xnj  J 


eIni  If  Thf  'nput  signal  components  are  assumed  to  appear 

bi  th“cJiDt0"-a"Tinpul  "nes  3  dlscrete  t,m“  '3S55 

mallv  I  i  J  *,  Th®  component  X  is  a  constant  nor- 

1.1.  The  weight  vector  is:  g 


'u  ' 

W0 


W  = 


W 


1 


W. 


Where  WO  Is  the  bias  weight, 
the  i nnerproduct  of  W  and  X 


.  The 


output 


Y 


i  s 


That  Is: 


Y(j)=xWx. 

J  3 


Error  e(j)  is  defined  as 
the  desired  response  <f(j)  and  the 
the  noise  cancelling  systems,  d(j) 
self.  . 


the  difference  between 
actual  response  Y(j).  in 
is  the  primary  input  i t- 


e( j )-d(j ) -xlw=d( j ) -WTXj 


'  . . 
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The  adaptive  algorithm  has  to  adjust  the  weights 
of  the  adaptive  linear  combiner  to  minimise  the  least  mean 
square  error.  The  adaptive  linear  combiner  aong  with  the 
tapped  delay  line  forms  the  adaptive  filter  shown  in  fig 
1.2.  As  before,  the  input  signal  vector  is: 


Xj~n+1 ) 


The  componants  of  this  vector  are  delayed  versions 
of  the  input  signal  X  .  This  filter  permits  the  adjust¬ 
ment  of  gain  and  phase  at  many  frequencies  simultaneously. 
The  total  length  of  the  delay  line  is  determined  by  the  re¬ 
ciprocal  of  the  desired  filter  frequency  resolution. 


ADAPTIVE  NOISE  CANCELLER  AS  A  NOTCH  FILTER: 


The  notch  filter  is  required  in  the  situation 
where  the  primary  input  is  corrupted  by  an  additive  unde¬ 
sired  sinusoidal  interference.  .  A  notch  filter  can  easily 
realised  by  an  adaptive  noise  canceller.  .  The  advantage  is 
that  it  offers  easy  control  of  bandwidth  and  the  capability 
of  tracking  the  exact  frequency  of  interference. 


Fig  2  shows  the  single  frequency  noise  canceller 
with  two  adaptive  weights.  .  The  primary  input  is  assumed  to 
be  any  kind  of  signal—  periodic  or  transient  or  stochastic 
or  the  combination  of  these.  The  reference  input  assumed  to 
be  a  pure  sine  wave  C  cos(wO+0).  .  The  primary  and  the 
reference  inputs  are  digitised  at  2*pi/T  rad/sec  sampling. 
The  reference  input  is  also  phase  shifted  by  90  deg  and 
again  digi tised.  . 


Fig  3  is  the  flow  diagram.  It  shows  the  operation 
of  the  LMS  algorithm.  The  weights  are  updated  as  shown  in 
the  diagram  by. 


wfl( j+l)*wTl( j)  f,2ue(  j)A(  j  7 
wT2( j+l)*wT2( j)*2ue( j)B( j) 
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where  e(j)  is  the  error 
The  reference  inputs  are: 


A( j )=C*cos(wO  j  T  +0  ) 
8(j)=C*sin(wO  j  T  +  p  ) 


The  isolated  impu! 
to  the  filter  output  is  obta 
poit  'D'  to  point  ’B'  being 


se  response  from  the  error 
ined  with  the  feedback  loop 
assumed  to  be  broken. 


e(  j ) 
from 


point  of 
time  j=k. 


Let  an  Impulse  of  amplitude  'a* 
error  signal  that  is  at  point 


be  applied  at  the 
'C*  at  a  discrete 


That  is:  at  i=0,  e(j)=a; 

i «e.  e(j)  «  a  *  d(i ); 

and  cT(  1 )  *  2  for  j  ,  p. 
<T(i)=0  for  i  *  0; 

Therefore  e(j)  =  a  *  tf(j  -  k)  —  ( 

The  response  at  point  'E*  is  then: 


e(j)*A(j)  =  a  *  C  cos(w0  kT+jjf  )  for  j=k 
*  0  for  j  =  k 


This  is 
the  instantaneous 


the  input  impulse 
value  of  A( j )  at 


scaled 
j  =  k. 


in  amplitude  by 


The  signal  flow  path  from  the  point  *E* 


to  point 
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*F*  Is  that  of  a  digital  integrator  with  transfer  function 
of  2u/(z-l)  and  the  impulse  response  2*p*  Utj-1)  where  U(J) 
is  the  discrete  unit  step  function. 


The  response  at  point  'F'  is  obtained  by  convolv¬ 
ing  2u  *  U(J-1>  with  e(j)A(j). 


i.e.  .  A(j)  =  2jx  a  C  cos(wC  kT+  fl  )  where  j>k+l  -(5) 


This  step  function  which  was  scaled  at  '£'  and  de¬ 
layed  at  *F*  is  now  multiplied  by  A(j)to  yield  the  response 
at  'G’  as 


yl(j)*2/iaCcos(w0jT+ J3T )  cos(wOkT+  0  )  -(6) 

where  j  *£  k+1 


the  response  at  point  *K'  is  also  obtained.  The 
signal  flow  path  from  point  *H'  to  point  *  I*  would  show  an 
impulse  response  of  2/i  U(j-l)  with  U(j)  being  the  step  func¬ 
tion.  The  response  at  point  ’I*  is  then 


Bj  =2jj  a  C  sin(wO  kT+  ^  )  —(7) 

where  j  St  k+1 

This  again  is  multiplied  by  Bj  to  obtain  the  response  at  *K* 
as 

y2(j)  *2 pa  (^sfnCwO  JT+ 0  )  sin(wO  kt+  0  )  —(8) 

where  j  k+1 

The  combination  of  Yl(j)  and  Y2(j)  yields  the  response  at 
the  filter  output  point  *D*  as 


Y { j >  =  2jn  a  C  cos  wOT(j-k) 


—  (9) 


=2paC  U( j-k-l)coswOT( j-k) 


—  (10) 


This  is  a  time  invariant  impulse  response  which  Is 
proportional  to  the  input  impulse.  .  The  Linear  transfer 


LA, 


ywimgaia  . . 
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function  for  the  noise  canceller  can  now  be  derived  as  fol¬ 
lows.  If  the  time  k  is  set  to  zero,  the  unit  impulse  res¬ 
ponse  from  the  point  *C*  to  *0*  is 


Y ( j )  =  iji  C  U(j-1>  cos  (wO  jT> 
The  transfer  function  of  this  path  is 

G(Z)=2uC2f  2<2-cc,s'“0T) 


—  (11) 


l  (,2- 


2zcos(w  T) +1 


=2uC2f  ZC°S{woT)  '  1 


1 


If  zco 

I  X 


z  -2zcos(w  T)  +  1 
o 


This  function  can  be  expressed  in  terms  of  radian  sampling 
frequency H =  2  pi/T  as 


G(z)  =  2uC‘ 


f 


z  cos(2lTwojCL  1)  - 


z  -  2z  cos(2TTw  JX 

o 


'—1 


the 
H  * 


If  the  feed  back  point  from  *0’  to  ’8*  is  now  closed, 
transfer  function  H(Z)  is  obtained  from  the  formulae 
G/(l+G)  as 

fz  +  2z  cos(2X’w  JOf1)  -1  ’I 

_ 2 _ I 

z2  -2z  cos  {2TTwoJn. _^)  +1  ) 

Equation* 15  shows  that  the  single  frequency  noise  canceller 
has  the  properties  of  a  notch  filter  at  the  reference  fre¬ 
quency  wO  .This  is  also  shown  experimentally.  . 


APPLICATIONS: 


There  are  a  variety  of  practical  applications  such 
like  the  cancellation  of  Noise  in  speech  signals,  cancella¬ 
tion  of  antenna  sidelobe  interference  cancellation  of  sev¬ 
eral  kinds  of  Interference  In  Electrocardiography  etc,,..  The 
simulated  experiments  and  their  results  will  now  be  shown. 
These  would  Indicate  the  use  of  the  adaptive  noise  canceller 
In  various  environments,.  . 
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FMter  1  Is 
fixed  frequencies  are 
to  be  cancelled.  Consider  an  Input  of 
60c/s,  350c/s,  400c/s  and  450c/s.  .  If  the 

to  be  el  I Imlnated,  the  program  for  Filter  1 
with  the  output  plots  for  the  filter. 


the  hypothetical  case  where  various 
present  and  the  undestred  frequency  Is 
Input  of  fixed  signals  at 
60  c/s  signal  Is 
is  shown  along 


Filter  2  is  another  form  of  notch  filter.  This 

frequencies6  ??  h?  sfs?aI  f/om  a  Primary  input  of*varying 
frequencies.  Signal  varying  frequencies  in  the  range  of  300 

?z  “  S' ’thS  JtttU  "W!  hz  are  generated  as 

input  to  the  filter.  As  before  the  60  hz  will  be  removed 
odaptively.  The  Filter  2  and  it's  output  plots  ae  shown. 

...  Adaptive  filter  Is  eaually  applicable  to  filter 

sh  i  bv  Ih2'?  signals  and  varying  frequencies.  This 

varvine  frenl?  K  /he  pr,1)a7  ,nput  has  the  signal  ln 
rage  of  40  S  to  ?n  K*f°re  and,the  ,ower  frequencies  in  the 

none*  Li  C*f«.  I  ,hz  fre  comP,etely  eliminaed.  The  res¬ 
ponse  for  Filter  3  Is  also  shown  . 
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C  AA (  I  OCAfiOUH  t  )  ) 

CIO  !Jlfr  ire-..  II  )  AA  (  I  )  .  A  (  ’  ) 

io  con  1 1  mi.  if:: 

20  U-  l.lHl 

m:  vuivn 

l:NU 


oooooooo 


9 

C 

1 

C 

2 

C 


*************  *A*  ********  *  ■***■★:►:**★★*★*★*★****★**'*★*■★'***★***★*★ 

FILTER  1 :  Till  S  FILTER  MAS  FIXED  PRIMARY  INPUT  FREQUENCIES 
AT  60HZ,  350IIZ,  40QHZAHD  4  50IIZ  WHICH  ARE  ALL 
SINUSOIDAL.  .THE  GOHZ  IS  BEING  CONSIDERED  AS  THE  NOISE 
FREQUENCY  TO  ABE  REMOVED  ADAPTIVELY. 

THE  OUTPUT  IS  A  PLOT  OF  POWER  SPECTRUM  IN  DB.  . 

THE  FILTER  i/P  AND  O/P  PLOTS  ARE  EXACTLY  IN  SAME  SCALE 
****★******★*★*****★***★★*★**★★*****,*★****★**•★**★*^^****** 

COMPLEX  P1,P2,X2,X3,X4,X6,X7,X9 

COMPLEX  Yl(257),S(257),RFl(257),RF2(257),Xl(257) 

T=0 .  . 

P1  =  (0.„0.). 

T2  =0 . 

P2  =  (0.,.0.  ), 

X2  =  (0.„0.), 

X3  =  (0.,,0.). 

X4  =  (0 .  ,.0 . ). 

X6  =  (  0  .  ,.0 . ). 

X  7  =  (  0 . 0 . ). 

X9=(  0 . ,  0 . ). 

DO  9  1=1,257 
Yl( ! )  =  (0 .  ,0 . ). 

RF  1(  I )  =(0 .  ,0. ). 

RF2C I  )=(0./0. ). 

XI ( I  )  =  (0.,,0. ). 

CONTINUE 
A=0 . 

F  =  350.0. 

DO  1  1=1,256 

-  . PRIMARY  INPUT  - . - 

IFCI.GE.7,5)  F=409 .0. 

IFCI.GE.175)  F  =  450.0, 

A=15 . 0*S i NCT) 

B=3.0*SIN(T2) 

P1=CMPLX ( A, 0 . ) 

P2=CMPLX(B,0. ). 

SCI) =P1+P2 

T=T+2  .  *.3 . 14*F/1000 .0. 

T2=T2+2. 0*3. 14*60. 0/256.0. 

CONTINUE 
A=0 .  . 

—  - . REFERENCE  INPUT - 

T=0 .  , 

DO  2  1=1,256 
A=2 . 0*S IN(T) 

RFKI  )=CMPLX(A,0.) 

T=T+2. 0*3. 14*60. 07256.0. 

CONTINUE 
T =0 .  . 

- --90  DEG  PHASE  SHIFTED  REFERENCE  INPUT . 

A=0 .  . 

DO  3  1=1,256 
A=2.0*COS(T-) 

RF2C ? }=CMPLX(A0. ) 


Filter  1 :  con  t  cl 


C 


% 


5 


C 


50 


C 

C 

553 

660 


505 

C 


T-T+2. 0*3. 1.4*60.07256.0, 
CONTINUE 

““-THE  FILTER--.— 

riCD-co.,  o. ) 

DO  5  I =1,256 
X  1C  I  )  =S  ( I  )  -  Y 1  c  I  ) 

X2=X1( I )*RF1( | ) 

X2  =X2*0 . 125 
X6  =X  7 
X7=X2+X6 
X2=X7*RF1(I ) 

X3=X1( I )*RF2C  I  ) 
X3=X3*0„1,25 
X9  =X4 


X4=X3+X3 
X3=X4*RF2(  I ) 
TIC  I  +  1) =X2+X3 
CONTINUE 
I  M=8 


CALL  F FTC XI, I M) 

CALL  FFTCYl, IM> 

CALL  FFTCS, !M) 

Do'50P?=l“56V  ^  ™E  FFT  °F  '/P  S(U 

GG;Gl/ioooi^.,,**2tA'HAG<xin,)**2 

X1C I  )=CMPLX(GG,0.0.) 

GK:GKnooroUa,)‘*2tAIMAG(Y1(l),‘*2 
Y1CI )=CMPLX(GK,0.) 

(s(  1 }  )**2+AIMAG(S(  I  )  )**? 

PP  =  PP/ 1000.0, 


SCI  )=CMPLXCPP,0. ), 

CONTINUE 
CALL  INITTC120) 

CALL  PLOTS ( I BUF!, 5) 

^.L,L  ™  *<*;*-.'*  AX  I  $  '  ,-G,  10 .  „0 .  ,0  1  ' 
FORMAT { [  l  J  0  * Y  AX  ,s  1  /  6'  8  •  ,80.  ,0 .  S .  y 
WRITE  Cl, 660) 

r?s,’s;«;TfEr  the  plot'  hit  m  •> 


x=o. 


CALL  PLOT  ( 0 . 0,,  0 . 0 ,  - 3 ) 
DO  303  1=1,128 
Y=REALCSCI )) 


X=X  +  6 
.1  C=2 


CALL  FACTORCO . 01) 

CALL  PLOT  (X-,Y,IC) 

CONTINUE 

P.S,.  .OF  O/P  (.GO)')  IS  PLOTTED 


Fi 1  tor  lrcontd 


C 

C 


90 


> 


x=o. 

READ 

CALL 


ZK 


(1,558) 

CALL  AXIS(0.,0.,,’POW  SPEC’  8 


CALL  PLOT  (0.0,,0.0.,-3) 
DO  90  1=1,128 
Y=R£AL(X1( I ) ) 

1  C=2 
X=X  +  6 


CALL  FACTOR  (0.01) 

CALL  PLOT  ( X, Y, I C) 

CONTINUE 

CALL  ANMOUE 

CALL  F I N I TT( 0, 0 ) 

STOP 

END 


,0.,.0.,.l.). 

A  A. 


SPECTRUM  OF  FILTER  INPUT  &  OUTPUT  IN  DB 


oooooooooooooo 


FILTER  2:  THIS  FILTER  HAS  A  PRIMARY  INPUT  OF  VARYING 
FREQUENCIES  IN  THE  RANGE  OF  40C/S  TO  70C/S  AND 
300  C/S  TO  800C/S .  THE  FREQUENCY  OF  THE  GENERATED  SIGNAL 
IS  CONTINUOUSLY  VARYING  SINUSOIDALLY. 

THE  AMPLITUDE  VARIATION  OF  THE  SIGNAL  IS  ALSO  SINUSOIDAL 
GOC/S  IS  ASSUMED  TO  BE  THE  NOISE  FREQ  TO  BE  ELLIM1NATED 
THE  OUT  IS  A  PLOT  OF  POWER  SPECTRUM  IN  DB. 

THE  I/P  AND  O/P  PLOTS  ARE  IN  THE  SAME  SCALE 


9 


C 


1 

C 


****** 'k-k'k'k-k'k-k-Jc-k'k-k-k'k’k-k-k-k'kk-k-lelcir-kir’kie'k'k-k-k-k-k'k-k-k'X'kir’it 

COMPLEX  P1,P2,X2,X3,X4,XG,X7,X9 

COMPLEX  Y1(257),S(257),RF1(257),RF2(257),X1(257) 

T  =  0 . 

Pl-(0.,0.) 

T2  =0 . 

P2  =  ( 0 . , 0 .  ) 

X2=(0.„0.) 

X3“(0  .  ,  0  . ). 

X4  =  (0.„0. ). 

X6=(0 . , 0 , ) 

X7  =  (0.,0.). 

X9=(0 . / 0. ) 

DO  9  1=1,257 
Y1C  I  )  =  (0.,0.  )< 

RFK  I  )  =  C0.,0. ), 

RF2 ( I  )  =(0 . ,  0 . ), 

XI C I  )=(0.„0. ). 

CONTINUE 
A=0 . 

U=0. 

GA=0 . 0 
U2  =0 . 0. 

GA2  =0 . 0 

- PRIMARY  INPUT  (NOISE  CURRUPTED) - 

DO  1  1=1,256 
U=U  +  4 . 0 
GA=U/ 100 

F=400.0+(SIM(GA)*100.0) 

U2=U2+4 . 0 
GA2  =U2/ 100 . 0. 

F2=60.0+(SIN(GA)*10.0) 

A=5 . 0*S I N( T) 

B=3. 0*S I NCT2 ) 

P1=CMPLX( A, 0 . ) 

P2=CMPLX ( B,  0  . ) 

S ( I )=P1+P? 

T=T+2  .*3. 14  *F/ 1000 . 0, 

T2=T2+2.0*3.14*F2/256.0 

CONTINUE 

--—REFERENCE  INPUT-  — r  — 

A=0 .  . 

T=0 .  . 


filter  2 : con  td 


z 

C 


3 

C 


5. 

C 

520 

50 


C 

C 

558 

660 

C 


DO  2  1=1,256 
A=2 . 0*S I N( T) 

RF1(| )=CMPLX(A,0. ) 

T=T+2. 0*3. 14*60.0/256.0 
CONTI NUc 

- PHASE  shifted  reference  input - 

T=0 .  . 


A=0 . 

DO  3  1=1,256 
A=2 . 0*COS  C  T ) 

RF  2 ( I  ) =CMPLX ( A, 0 . ) 

T=T+ 2. 0*3. 14*60.0/256.0 
CONTINUE 


- THE  FILTER - 

Y1(1)  =  (0.,0.  ), 

DO  5  1=1,256 
x  1  ( I )=S( I )-Yl(  I  ) 

X 2  =X 1 ( I )*RF1C  I  ) 
X2=X2*0 . L25 
X6=X7 
X7=X2+X6 
X2=X7*RF1( I ) 

X  3  =X 1 C I ) *  RF  2  C  I  ) 
X3=X3*0 . 125 
X9=X4 
X4=X3+X9 
X3=X4*RF2( I ) 

Yl( I +1) =X2+X3 
CONTINUE 
I  M  =  8 


CALL  FFT ( XI, I M) 

CALL  FFTCS, I M ) 

S(l)  PRESENTLY  HAS  THE  FFT  OF  I/P  S(l) 

DO  50  1=1,123 

GG=REAL(X1C f ) )**2+AIMAG(Xl( I ) ) **2 
PP=P.EAL(S(  I  )  )**2+AIMAG(S(  I )  )**2 

FORMAT  {’■  '  ,  *  |  NPUT=  1 , 4X,  E14 . 5.,  10X,  'OUTPUT= 
SCI ) =CMPLX ( PP, 0 . ) 

X1CI )=CMPLX(GG,0.0) 

CONTINUE 


,4X,  E14 . 5.) 


CALL  I N I TT( 120 ) 

CALL  PLOTS ( I BUF, 1,5) 

CALL  PLOT  ( 1 . 0.,  1 . 0., - 3 ) 

WRITE (1,660) 

READ(  1,  558 )  I  ED 

CALL  AX  I S ( 0 . ,  0 . , ? I NPUT  POW  SPEC  IN  DB’,-20,9  0  0  inn 

CALI  AXISCO. .J>., -100. 

FORMAT (14) 

FORMAT ( 1  ','TO  START  THE  PLOT,  HIT  RETURN  ') 

A  =U  • 


CALL  PLOT  (1.0,1. 0,-3) 
I  C=2 

DO  303  1=1,128 


/) 


o  o 


Filter  2 : con td 


PP=REAL(S(  1  ) ) 

PP=20.0*LOG10(PP) 

Y  =  PP 
X=X  +  3 

CALL  FACTORC 0 . 02 ) 

CALL  PLOT  (X,Y,IC) 

303  CONTINUE 

C  P.S..  .OF  O/P  (G(  I  ) )  IS  PLOTTED 

X  =  0.  . 

READ  (1,453)  I ZK 
458  FORMAT (14) 

C  CALL  PLOT  (1.0,1. 0,-3) 

I  C=2 

CALL  AX  I  S  ( 0 .  .,0..  POW  SPEC  IN  DB  •  ,- 14 ,9  .  ,  0 .  ,  0 .  ,  1 . ) 
CALL  AX  I  S  (0  . ,  0  .  *  ,  1,  7 .  ,.9 0 . ,.0 . ,  1 .  ) 

CALL  PLOT  (1.0, 1.0., -  3) 

DO  90  1=1,128 
GG  =  REAL ( XI ( I  )) 

GG=2Q.0*LOG10(GG) 

Y=GG 
X=X  +  3 

CALL  FACTOR  (0.02) 

CALL  PLOT  (XY, IC) 

90  CONTINUE 

CALL  ANMODE 

'CALL  F I N I TT ( 0 , 0 ) 

STOP 

END 

> 


. . .  ..liiiiiAi.iiiiiki 


Filter  2 : con  td 


j 


303 

C 


if  58 
C 


C 


90 


> 


PP-REAL  C S  ( I  ) ) 
PP=2O.0*LOG1O(PP) 
Y  =PP 


X  =X  +  3 

CALL  FACTOR  CO. 02) 

CALL  PLOT  CX,Y, I C) 

CONTINUE 

P'S.  OF  O/P  ( G ( I ) )  IS  PLOTTED 
X  .  . 


READ  ( 1 , if 5 8 )  IZK 
FORMAT  (I  if) 


CALL  PLOT  Cl.  0,1. 0,-3) 

I  C=2 

CALI  AXIS  CO.  .0,.,*  POL/  SPEC 
CALL  AX  I S  C  0 . , 0 .  *.  ‘,1,7. ,90 
CALL  PLOT  Cl. 0,1. 0,-3) 

DO  90  1=1/128 
GG  =  REALC  X 1 ( I )) 
GG=20.0*LOG10CGG) 

Y  =GG 


IN  DB  1 ,  -lif  /  9 
•/0./1.) 


X=X  +  3 

CALL  FACTOR  C0.02) 
CALL  PLOT  (XY/IC) 
CONTINUE 
CALL  ANMODE 


■CALL  FINITTC0,0) 
STOP 
END 


/0 


I 

* 


f 


% 


DO  2  I =1,256  filter  2:contd 

A=2 . 0*S I  N(T) 

RF1C I ) =CMPLX (A, 0. ) 

T =T+ 2. 0*5. 14*60. 0/256.0 
CONTINUE 

phase  shifted  reference  input - 

T =0 .  . 

A=0 . 

DO  3  1=1,256 
A=2.0*COS(T) 

RF2C  I  ) =CMPL X ( A, 0 .  ) 

T=T+2 .0*3.14*60,0/256.0 
CONTINUE 

- THE  FILTER - 

Y1(1)=(0.,0.X 
DO  5  1=1,256 
Xl(  I  }=S( I ) - Y 1 C  I  ) 

X2=X1( I } *  R  F 1 ( I ) 

X2=X2*0 . 125 
X6  =  X7 
X7=X2+X6 
X2  =X7*RF 1( I ) 

X3=X1( I )*RF2( I ) 

X3=X3*0 .125 
X9  =X4 
X4  =X3+X9 
X3=X4*RF2( I ) 

Yl( I +1) =X2+X3 
CONTINUE 
I M  =  8 

CALL  F FTC XI, IM) 

CALL  FFTCS, IM) 

SCI)  PRESENTLY  HAS  THE  FFT  OF  I/p  SCI) 

DO  50  1=1,123  } 

GG=REA.L(X1(  I ) )**2+AIMAG(XlC I)  )**2 
PP=REAL(S( I ) )**2+AIMAG(S( I ) )**2 

sS;ui^7r,'4x'Eu-^iox',ouTpuT=,'«'E»-5> 

XI ( I )=CMPLX(GG,0.0) 

CONTINUE 
CALL  IN1TTC120) 

CALL  PLOTS C I BUF, 1,5) 

CALL  PLOT  Cl. 0,1. 0,-3) 

WRITE  Cl, 6 60) 

READC  1,  558)  I  ED 

CALL  AXISCO.,0.,.’ INPUT  POW  SPEC  IN  DB’  -20  q  n  n  inn  i 

roRMAHllo0-'0-'-'  . . 

rORMATC*  »,*T0  START  THE  PLOT,  HIT  RETURN  ') 

CALL  PLOT  (1.0, 1*0., -  3) 

TC=2 

DO  303  1=1,128 


C 


. . . 


‘  -*  *  *■  +  +  **-*•  -  k  f:  i  ^  ’k  i  +  + 


FILTER  2:  THIS  FILTER  HAS  A  PRIMARY  I/IPUT  Oc  VARY  fur 
FREQUEHCIES  IM  THE  RANGE  OF  40C/S  TO  70C/S  AfJD 
300  C/S  TO  800C/S.  THE  FREQUENCY  OF  THE  GENERATED  SIGMA! 
IS  CONTINUOUSLY  VARY  I  MG  S I NUSO I  DALLY.  AL 

TIIE  AMPLITUDE  VARIATION  OF  THE  SIGNAL  IS  ALSO  SIN'SOIDA! 

wm,}Sif5°2f2TTSrBE  T,,E  ri0,SE  fre5  to  ^elliminaIeS1 

THE  OUT  IS  A  PLOT  OF  POWER  SPECTRUM  IN  D3 
THE  I/P  AND  O/P  PLOTS  ARE  IN  THE  SAME  SCALP 


******  ************************  +  ****  +  ll.  +  +  +##il,t 

COMPLEX  Pl,P2,X2,X3,Xft,X6,X7,X9 

COMPLEX  Y1(257),${257),RF1(257),RF2C25?),X1(257) 

1  ~u  • 

P1=(0.,0.) 

T2  =0 . 

P2  =  (  0  .  ,  0 . ). 

X2  =  (0.,,0. }. 

X3-(0  .  ,.0 . ). 

X4  =  (0.„0.). 

X6=( 0 . , 0. ) 

X7=(0 . , 0.  ). 

X9=(0 . , 0 . ). 

DO  9  1=1,257 
Yl(l)=(0.,0.), 

RF1(I  )  =  (0./0.). 

RF2 ( I  )  =  (  0 .  ,  0 . ). 

X 1  ( I  )  =  ( 0 .  ,  0 . ). 

CONTINUE 
A=0 . 

U=0 . 

GA--0 . 0. 

U2  =0  -  0 
GA2  =0 . 0 

PRIMARY  INPUT  (NOISE  CURRUPTED) - 

DO  1  1=1,256 
U=U+4 . 0 
GA=U/100 

F=400.0+(S!M(GA}*100.0) 

U2=U2  +  4 . 0 
GA2=U2/100.0  ' 

F2=G0.0+(S!N(GA)*10.0> 

A=5.0*SIN(T) 

B=5.0*SIN(T2) 

P1=CMPLX(A, 0 . ) 

P2=CMPIX(D,0. > 

S(  I  )  =  P1+P2 

T=T+2  .*,3. 14*F/1000.0 

T2=T2+2.0*3.14*F2/256.0 

CONTINUE 

- REFERENCE  INPUT— 


Appendix  C 


1.  A  VLSI  Residue  Arithmetic  Multiplier,  IEEE  Transactions  of  Circuits 
and  Systems,  W,K.  Jenkins  Editor,  revised  paper  accepted  for 
publication,  approx.  10  pages. 

2.  Large  Multiplier  Multipliers,  ICASSp  80  Proceedings  Denver, 
Colorado,  April  8-11,  4  pages. 

3.  Large  Moduli  Multipliers,  1980  International  Symposium  on 
Circuits  and  Systems,  Houston,  Texas,  April  28-30,  3  pages. 

4.  A  New  Technique  For  WFTA  Input/Output  Reordering,  International 
Journal  of  Computer  and  Information  Sciences,  J.  Tou  editor, 
accepted  for  Vol.  10,  Number  1,  approx.  15  pages. 

5.  The  Realization  of  Adaptive  Kalman  Filter,  pending  ACTA,  M.  Hamza 
Editor. 
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ABSTRACT- 

I n  thf  work  piesent  a  new  taole  look-uo 
stors  scheme  ano  a  class  of  taole  look-up 
r.nl tiol iers  capable  of  working  with  exact 
(  odular)  numbering  systems 

'•'emor^  savings  associated  with  the  new  look-up 
.■nul tipi ie",  when  compared  to  contemporary 
methods,  are  shown  to  te  on  the  order  of-  2/'.\ 
,here  >l=2n,  nsinput  wor Jlength.  Througnput  is 
Shown  to  be  egia'  -o  that  ootained  jsing  VLSI 
and  classic  archi tecture? . 


Introduction: 

■Jigita'  signal  processing  is  a  study  undergoing 
accelerated  growtn,  acceptance,  and  appliection. 

Wi ih  the  possible  exception  of  number  theoretic 
transforms,  digital  signal  processing  has  b;en 
principally  advanced  through  technological 
acnievements.  These  incluae  the  microprocessor, 
low  cost  hi qn  performance  memory,  and  the  read 
only  memory  (ROM).  The  availability  of  the  ROM 
has  challenged  our  traditional  attitude  towards 
perfotming  digital  arithmetic.  In  particular, 
the  art  of  fixed  point  multiplication  has  under¬ 
gone  a  partial  metamorphosis  through  tne  use  of 
ROM  based  table  look-ups.  Since  multiplication 
nas  been  a  principal  speed-cost-and  complexity 
limitation  to  digital  filtering,  advancements 
in  this  area  have  been  warmly  received. 

EXISTING  LOOK -UP  ARITHMETIC  TECHNIQUES: 

Much  of  the  reported  work  on  ROM  based  fixed 
point  multiplication  has  been  ir.  support  of 
linear  shift  invariant  digital  filtering.  Authors 
suen  as  Jenkirs  and  Leon,  Soderstrand,  etc.,  have 
studied  the  cost-speed  metrics  of  digital  filters 
using  the  residue  numbering  system.  The  principal 
advantage  of  the  residue  numbering  system  Is  that 
support:  fixed  point  multiplication  and  addition 
without  need  of  preserving  "carry  information". 
Thus,  parallel  operations  are  idr.iissible.  In 
addition,  modular  multipliers  were  shewn  to  be 
realizable  using  table  look-up  methods  and  ROM's. 

Jenkins  recently  questioned  wnetner  the  perform¬ 
ance  of  tne  residue  numoerina  system  was  due  to 
the  intrinsic  properties  of  the  system  or  the  use 
of  look-up  multipliers.  It  was  concluded  that  "‘it 
appears  coat  .men  no  rounding  (scaling)  is 


requtred,  the  residue  structure  always  provides 
better  performance  with  regard  to  multiplication, 
when  all  nul tipi ications  must  be  rounded,  the  2‘s 
comolement  structure  provides  better  performance. 
Sine'  most  non-trivial  filter  and  transform 
applications  require  a  high  plurality  of  multiply 
and  add  operations  (almost  always  insuring  the 
overflow  of  the  limited  integer  dynamic  ranges 
Currently  being  implemented  (<-'°  tyo.))  the 
future  of  residue  based  digital  system  may  apoear 
limited  at  first  glance.  We  shall,  in  tnis  work, 
present  seme  new  results  which  overcome  this 
contemporary  deficiency  and  in  fact  make  the 
potential  of  mrdular  arithmetic  systems  even  more 
exciting. 

Residue  ALU’s 

The  disadvantages  of  the  residue  number  syst»ms 
are  manifold.  Since  the  RF1S  possess  no  sioni- 
ficant  digit,  decimal  to  residue  conversion, 
division,  magnitude  comparison,  and  arithmetic 
shift  operations  are  cumbersome  and  snould  be 
avoided.  Register  overflow,  due  to  its  finite 
dynamic  range,  impose  a  severe  constraint  on  tne 
RNS  operations.  Unlike  weighted  numbers  (decimal, 
binary,  etc.)  where  rounding  or  truncating  le3st 
significant  digits  can  control  overflow,  such  is 
nit  the  case  in  the  RMS.  Since  there  is  an 
ibsence  of  least  significant  digits,  the  more 
general  and  inefficient  operation  known  as  scaling 
must  be  used.  Since  scaling  is  a  form  of  division, 
its  use  should  be  discourageo.  To  /iin  insignt 
into  this  problem,  consider  the  inner  product  of 
two  31-dimensional  real  vectors  x  and  y  whose 
entries  are  encoded  as  residue  digits  with  respect 
to  P={32, 31 ,29,27).  Without  scaling,  tbe  dynamic 
range  of  x  and  y  would  be  limited  tc  V-3  where 
V=(M/2)/31 =25056.  Therefore,  to  insure  that  no  , 
worst  case  overflow  can  occur,  a  7. 3-bit  (ie:  V-  = 
158s,2’>d)  dynamic  range  limitation  must  be  imposed 
on  x  and  y.  With  scaling,  larger  input  ranges  car, 
be  used  at  the  expense  of  statistical  accuracy  in 
the  output  space  (ana’ogous  to  rounooff  errors). 

Due  to  the  dynamic  range  of  RMS  systems,  one  is 
general  1;  forceo  to  accept  one  of  the  following 
two  overflow  prevention  strategies. 

1.  Increase  the  dynamic  range  to  a  sufficiently  ■ 
large  value  by  adding  more  moduli  to  or 

2.  ‘lake  scaling  a  more  efficient  operation. 

The  first  notion  rppre s°n*,s  a  brut"  forcn 

to  tne  problem.  Such  an  approach  will  increase  to 


cost  and  complex' t/  "5tr'CS  jf  >1  c;3tor.  to 
addicion,  tne  moduli  sot  1  must  be  tailored  to  a 
unique  filter.  Tne  otner  aporcac.o  appears  to  os 
the  most  popular  at  this  time.  Szabo  and 
Tanaka,  and  others,  nave  concentrated  on  the 
scaling  efficiency  through  the  choice  of  the 
three- tuple  modul  i  set  P’i2n-1 .2n,2n*U .  Tins 
moduli  set  has  tne  ability  to  efficiently  scale 
a  residue  number  by  any  one  of  the  cnosen  moduli. 
Hoxeve' ,  there  is  an  intrinsic  limitation 
plaguing  this  method  and  it  is  its  dynamic 
range.  Using  a  large  high-speed  memory  unit, 
say  4Kxl,  the  input  addressing  space  is  limited 
vC  2  .  This  means  that  a  moduli  is  - 

technically  limited  to  0^-2°  { ie :  x;Sy;<2 ]i). 
Therefore,  the  dynamic  range  of  ar.y  modular  ,3 
operation  is  given  by  ,‘l=(2"-l  ){2H)t2n*l  )\i^n’2 
in  many  applications,  an  13-b’t  resolution  is 
insuf'icient  resolution. 

.‘lew  Results 

Two  new  memory  efficient  algorithms  have  been 
aerive«i  and  is  based  on  a  novel  factorization 
of  a  bilinear  form.  Over  a  real  field  it  is 
ODvious  that 

xy*((<*y)/ 2)2-((x-y)/2)2  1. 


jrj  tnrouonp-j'  { ‘n rough  tne  '•Jduct  ion  or  absence 
of  traditional  scaling  operations)  can  be 
•achieved,  rinally.  several  versions  of  tne  M* 
ilnorithm  can  he  considered  They  are  summarized 
in  figure  3. 

Upon  closer  investigation  of  the  table  look-up 
data  base,  a  potential  nuisance  can  be  fcynd.  It 
can  be  exemplified  by  observing  that  if  s-=9, 
p=32,  then  Ms-)*<9V4>32=20.2S.  Therefore,  it 
may  be  reguired  that  two  addi tional  fractional 
bits  may  need  to  be  added  to  the  table's  output 
word  length.  However,  this  is  not  the  case  as 
suggested  by  the  following  theorem: 

Theorem:  Let  ’*v!j  de"Cte  the  integer  value  of  v. 

Then  z=<l | *( s*) | 1 - j ! ?{ s" ) J ; 

That  is,  only  tne  integer  value  of  i>  need  be  used 
and  the  fractional  bits  of  O(s-)  can  be  ignored. 

Proof:  Let  ( x+y)/2-vrk/2;  (x-y)/2-q*b/2  where 

2 

k=Q  or  1.  Then  z=«{x+y)  / 4>  - 

<!  <-y)^/4>pV_j:<svJ'kv*'b^/l>p-<q+lC*/t((^/4>p>p 

=-:.:v*k/->  *k2/4— ■ q*<q>  -b^/4>  *<&(S*)- 
P  P  P 

Ms")>d 


which  'n  modular  form,  becomes 
<xy>.-<»(s,')->(s') 


2. 


s2>  with  s*=( x+y >/2  and  s~=( x-y >/2 . 

u 


p  -V-  p 

•o 

where  e(s 

at  first  glance  this  algorithm,  which  shall  be 
referred, to  as  a  minimal  memory  modular  multi¬ 
plier  (Ma),  would  appear  to  be  counterproductive 
with  respect  to  a  memory  conservation  metric.. 

The  memory  requirements  associated  with  the  hr 
will  be  shown  to  be  substantially  less  than 
those  of  direct  mechanizations.  First,  it  should 
be  apparent  that  the  integer  s*  and  s’  found 
in  equation  2  is  bounded  from  above  by  2n* 1 . 
Therefore,  only  a  (n+l)-bit  tablj  addressing 
space  is  required  to  realize  J>( s- )  versus  the 
2n-oit  space  needed  for  direct  architectures, 
tt  would  appear  however,  that  there  is  an 
exception  to  this  rule.  Since  one  of  the  moduli 
chosen  is  p=2n*l.  Here  the  maximal  value  of 
sv(or  s')  is  2n*‘  which  would  technically 
require  a  (n+2)-bit  address.  However,  by  using 
tne  ornt-pcol  found  in  figure  1,  the  table  size 
can  be  reduced  to  2n+1  words  for  all  moduli. 
Hers,  the  overflow  bit  serves  to  differentiate 
s^=0  from  2n+’ . 


The  !r  system  architecture  is  abstracted  in 
Figure  2.  This  uses  2n+1  -grd  high-speed  memory 
for  modular  arithmetic  look-up  operations.  Using, 
for  example,  the  previously  referenced  4x-30ns 
device,  moduli  having  an  11-bit  dynamic  range 
{'•-s.  6-bit  in  the  direct  form)  can  be  mechanized. 
This  would  yield  a  three-moduli  dynamic  range  on 
the  order  of  2JI*  '''yd. 6- 10'.  That  is,  without 
an  increase  in  memory  size  {and  therefor"  access 
time),  the  dynamic  range  of  the  fv  is  2-^/2'"=2'5 
times  larger  than  that  obtainable  through  direct 
means:  This  large  increase  in  dynamic  range 
makes  the  RtiS  a  viable  alternative  to  traditional 
filter  design  methods.  3otn  improved  precision 


As  a  result,  the  Parallel  architecture  is  equiva¬ 
lent  to  that  shown  in  figure  4. 

Modulo  p  Adder 

The  M-*  multiplier  requires  a  modulo  p  adder  be 
used  to  combine  the  two  component  parts  of  the 
solution  {namely  <*>{ s*)  and  Ms')).  Modulo  p 
adders  pose  an  interesting  design  problem. 

Unless  a  fast  modulo  p  adder  can  be  fabricated, 
the  overhead  associated  with  addition  will  offset 
any  gain  in  throughput  achieved  through  table 
look-ups.  For  the  moduli  chosen,  2n-l,  2n,  and 
2n4-1 ,  only  the  modulo  2n  adder  can  be  realized 
directly  (n-bit  adder  with  ignored  overflow).  It 
would  however,  be  desirable  to  use  a  n-bit  adder 
to  realize  the  mod^o  2"-l  and  2n+l  adder  as  well. 

Using  n-bit  AND  gates  to  sense  the  zero  condition 
of  the  overflow  bit  OVF,  and  the  sign  bits 

of  Ms+)  and  Ms"),  a  combinational  logic 
routine  can  be  defined  which  will  convert  xsxjN 
into  <s>  .  It  can  be  noted  that  the  mapping 

requirements  are: 

1.  for  p-2'*-1,  map  s  to  s  or  s-2’^+1  =<<s>  ,+l>  ,, 

j1’  2' 

M  M  C  U 

2.  for  p-2  ,  map  s  to  s-2’ =<s> 

2  , 

3.  for  p=2'*+l,  map  s  to  s  or  s-2'*-l=<<s>  ,. 

2‘  2'' 

Svooose  the  moduli  pr2n‘ 1 ,  n=12,  ;s  to  be  imple¬ 
mented.  T.y  using  two  eoiwerc icily  available 
15x9  PLA's  in  parallel,  the  !2-bit  outcome  of  an 
n-bit  adder  anj  the  four  control  bits,  can  be 
converted  to  13-bit  mask.  The  mask  would  trans¬ 
form  the  output  of  a  high-speed  n-bit  adder  to 
s  or  s-2n-l,  depending  on  the  state  of  the  4 
.  control  bits.  Sased  on  a  25-ns  12-bit  Schottky 
look-ahead  adder,  a  20-ns  16x9  ?LA,  and  10-ns 
r£T  -’ask  switches  a  66 -ns  :cdjlo  o  adder,  for 
p-2n-l,  2n,  and  i”1!  can  be  realized.  The 


« 


essence  of  .1  55-ns  modulo  p  idoe-*  .till  now  a i  1 0* 
a  KC-ns  lirge  moduli  ■'osidue  ~u I c f 0 1  > ef  *.0  ce 
bused  on  35-ns  JKxl  HMDS  r.q-oxy  units.  For  1 
moduli  set  {2’*—  1,  2  .  2  "'•!■,  a  fixed  point 
-n^ltirlier ,  having  jn  output  dynamic  range  of 
2JO-212,  can  tous  be  fabricated  having  a  word 
rate  of  7.14311  multiplications  per  second  or 
23. 5M  multiplications  per  second  if  a  pipe¬ 
lined  architecture  is  used. 

Summary: 

The  residue  number  system  offers  the  potential 
for  high  speed  parallel  arithmetic.  This  class 
of  arithmetic  has  been  demonstrated  to  be  useful 
in  designing  recursive  algorithms,  t'a.os forms, 
and  digital  filters.  Cne  of  the  principal 
limitations  to  its  uue  is  its  limited  practical 
dynamic  range.  To  overcome  this  problem,  ) 
large  moduli  multiplier,  for  the  moduli  set 
(2n-l,  2n,  2n+l),  was  designed.  Inis  high-speed 
large  moduli  system  was  the  product  of  the  new 
l-l4  algorithm  and  new  technologies  (RA11  and  PLA's). 
fhe  oractical  residue  mu! tidier  is  capabl?  of 
supporting  a  pipelined  execution  rata  of  23. 5M 
multipliers  per  second. 

References 

1.  6. A.  Jullien,  "Residue  Number  Scaling  and 
Other  Operations  Using  ROM  Arrays,”  IEEE 
Trans.  Computers,  Vol  C-27,  Ho.  4, 

pp  325-337,  April  1973. 

2.  C.rl.  Huang  and  F.J.  Taylor,  "Memory 
Compression  Scheme  For  Modular  Arithmetic, 
to  be  published,  IEEE  ASSP. 

3.  M.A.  Soderstrand,  “A  High  Speed  Low-Cost 
Recursive  Filter  Using  Residue  Arithmetic," 
Proc.  IEEE,  Vol.  65,  No.  7,  July  1977. 


Figure  2 


4.  H.3.  Szabo  and  R.I.  Tanaka,  Residue  Arithme¬ 
tic  and  Its  Applications  To  Computer 
Tecnnolony ,  McGraw-Hill,  1 967. 

5.  'A. A.  Jenkins  and  3.J.  leon,  "The  Use  Of 
Residue  Number  System  in  the  Design  of 
Finite  impulse  Response  Filters,"  IEEE 
CSS,  CA5-24,  April  1977. 

6.  W.K.  Jenkins,  "Techniques  for  Residue-to- 
Analog  Conversion  for  Residue-Encoded 
Digital  Filters,"  IEEE  CSS,  CA5-25, 

July  1978. 

7.  A.  Peled  and  3.  Liu,  “A  New  Hardware 
Realization  of  Digital  Filters,"  IEEE  ASSP, 
ASSP-22,  December  1974. 

3.  W.J.  Jenkins,  'A  Highly  Efficient  Residue- 
Combinatorial  Architecture  for  Digital 
Filters,"  Proc.  of  the  IEEE  (Letter), 

Vol.  66,  No.  6,  June  1973.  pp  700-702. 


This  work  was  partially  supported  under  AFOS? 
Grant  No.  AF49520-79-C-0065. 


!  |  |  *.*,  |  •  r,..  l‘  '-IT  ~r T7ZS 


Figure  3 


fL/O 


fit  W;  >  *.***#$$&*<,*>? 


1  ’  **• '  . . Win*1 . .  MMi  h  Am  I ,,!' ,,'itt  «,I»  i  „  ,rt ,  4 1  M,«„, , . 
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ABSTRACT: 

In  this  work  we  present  a  new  table  look-up 
storage  scheme  and  a  class  of  table  loox-up 
multipliers  capable  of  working  with  exact 
(modular)  numbering  systems. 

Memory  savings  associated  with  the  new  look-up 
multiplier,  when  compared  to  contemporary 
methods,  are  shown  to  be  on  the  order  of  2/N 
where  N=2n,  n^input  wordlength.  Throughput  is 
Shown  to  be  equal  to  that  obtained  using  VLSI 
and  classic  architectures. 


Introduction: 

Digital  signal  processing  is  a  study  undergoing 
accelerated  growth,  acceptance,  and  application. 
With  the  possible  exception  of  number  theoretic 
transforms,  digital  signal  processing  has  been 
principally  advanced  through  technological 
achievements.  These  include  the  microprocessor, 
low  cost  high  performance  memory,  and  the  read 
only  memory  (ROM).  The  availability  of  the  ROM 
has  challenged  our  traditional  attitude  towards 
performing  digital  arithmetic.  In  particular, 
the  art  of  fixed  point  multiplication  has  under¬ 
gone  a  partial  metamorphosis  through  the  use  of 
ROM  based  table  look-ups.  Since  multiplication 
has  been  a  principal  speed-cos t-and  complexity 
limitation  to  digital  filtering,  advancements 
in  this  area  have  been  warmly  received. 

EXIST1  _N£ LOOK-UP  ARITHMETIC  TECHNIQUES: 

Much  of  the  reported  work  on  ROM  based  fixed 
point  multiplication  has  been  in  support  of 
linear  shift  invariant  digital  filtering.  Authors 
such  as  Jenkins  and  Leon,  Soderstrand,  etc.,  have 
studied  the  cost-speed  metrics  of  digital  filters 
using  the  residue  numbering  system.  The  principal 
advantage  of  the  residue  numbering  system  is  that 
supports  fixed  point  multiplication  and  addition 
without  need  of  preserving  "carry  information". 
Thus,  parallel  operations  are  admissible.  In 
addition,  modular  multipliers  were  shown  to  be 
realizable  using  table  look-up  method's  and  ROM's. 

Jenkins  recently  questioned  whether  the  perform¬ 
ance  of  the  residue  numbering  system  was  due  to 
the  intrinsic  pioperties  of  the  system  or  the  use 
of  look-up  multipliers.  It  was  concluded  that  "it 
appears  that  when  no  rounding  (scaling)  is 


required,  the  residue  structure  always  provides 
better  performance  with  regard  to  multiplication. 
When  all  multiplications  must  be  rounded,  the  2's 
complement  structure  provides  better  performance. 
Since  most  non-trivial  filter  and  transform 
applicat'ons  require  a  high  plurality  of  multiply 
and  add  operations  (almost  always  insuring  the 
overflow  of  the  limited  integer  dynamic  ranges 
currently  being  implemented  (<2^6  typ. ) )  the 
future  of  residue  based  digital  system  may  appear 
limited  at  first  glance.  Wo  shall,  in  this  work, 
present  some  new  results  which  overcome  this 
contemporary  deficiency  and  in  fact  make  the 
potential  of  modular  arithmetic  systems  even  more 
exciting. 

Residue  ALU's 

The  disadvantages  of  the  residue  number  systems 
are  manifold.  Since  the  RNS  possess  no  signi¬ 
ficant  digit,  decimal  to  residue  conversion, 
division,  magnitude  comparison,  and  arithmetic 
shift  operations  are  cumbersome  and  should  be 
avoided.  Register  overflow,  due  to  its  finite 
dynamic  range,  impose  a  severe  constraint  on  the 
RNS  operations.  Unlike  weighted  numbers  (decimal, 
binary,  etc.)  where  rounding  or  truncating  least 
significant  digits  can  control  overflow,  such  is 
not  the  case  in  the  RNS.  Since  there  is  an 
absence  of  least  significant  digits,  the  more 
general  and  inefficient  operation  known  as  scaling 
must  be  used.  Since  scaling  is  a  form  of  division, 
its  use  should  be  discouraged.  To  gain  insight 
into  this  problem,  consider  the  inner  product  of 
two  31-dimensional  real  vectors  x  and  y  whose 
entries  are  encoded  as  residue  digits  with  respect 
to  P={3?, 31 ,29,27).  Without  scaling,  the  dynamic 
range  of  x  and  y  would  be  limited  to  V--  where 
V=( M/2)/31 -25056.  Therefore,  to  insure  that  no  , 
worst  case  overflow  can  occur,  a  7.3-bit  (ie:  Vs 
158^.2' -3)  dynamic  range  limitation  must  be  imposed 
on  x  and  y.  With  scaling,  larger  input  ranges  can 
be  used  at  Lite  expense  of  statistical  accuracy  in 
the  output  space  (analogous  to  roundoff  errors). 

Due  to  the  dynamic  range  of  RNS  systems,  one  is 
generally  forced  to  accept  one  of  the  folloviing 
two  overflow  prevention  strategies. 

1.  Increase  the  dynamic  range  to  a  sufficiently 
large  value  by  adding  more  moduli  to  P,  or 

2.  Make  scaling  a  more  efficient  operation. 

The  first  option  represents  a  brute  force  attack 
to  the  problem.  Such  an  approach  will  increase  to 
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cost  and  complexity  metrics  of  a  filter.  In 
addition,  the  moduli  set  P  most  be  tailored  to  a 
unique  filter.  The  otner  approach  appears  to  be 
the  most  popular  at  this  time.  Szabo  and 
Tanaka,  and  others,  have  concentrated  on  the 
scaling  efficiency  through  the  choice  of  the 
three-tupie  modul i  set  P={2n-1,2n,2n*l).  This 
moduli  set  has  the  ability  to  efficiently  scale 
a  residue  number  by  any  one  of  the  chosen  moduli. 
However,  there  is  an  intrinsic  limitation 
plaguing  this  method  and  it  is  its  dynamic 
range.  Using  a  large  high-speed  memory  unit, 
say  4Kxl ,  the  input  addressing  space  is  limited 
to  2'*.  This  means  that  a  moduli  p^  is  - 
technically  limited  to  p-<2”  ( ie :  xjSy.<2'^). 
Therefore,  the  dynamic  rlnge  of  any  modular 
operation  is  given  by  M=(2rt-1 ) (2n ) ( 2n* 1 )u23n=2  . 
In  many  appl ications,  an  18-bit  resolution  is 
insufficient  resolution. 

New  Results 

Two  new  memory  efficient  algorithms  have  been 
derived  and  is  based  on  a  novel  factorization 
of  a  bilinear  form.  Over  a  real  field  it  is 
obvious  that 

xy=((x*y)/2)2-((x-y)/2)2  1. 


and  throughput  (through  the  reduction  or  absence 
or  traditional  scaling  operations)  can  be 
achieved,  finally,  several  versions  or  the  JT 
algorithm  can  be  considered.  They  are  summarized 
in  figure  3. 


Upon  closer  investigation  of  the  table  look-up 
data  base,  a  potential  nuisance  can  be  fo^nd.  It 
can  be  exemplified  by  observing  that  if  s-=9, 
p=32,  then  ^(s-)  =  <3^/lS>j^-20.?5.  Therefore,  it 
may  be  required  that  tw5  additional  fractional 
bits  may  need  to  be  added  to  the  table’s  output 
word  length.  However,  this  is  not  the  case  as 
suggested  by  the  following  theorem: 


Theorem:  Let  ||v||  denote  the  integer  value  of  v. 

Then  z=<j  |+(s*) 1 1-|  |4>(s‘) I  i>p- 

That  is,  only  the  integer  value  of  4>  need  be  used 
and  the  fractional  bits  of  $(s-)  can  be  ignored. 

Proof:  Let  (x+y)/2-v+k/2;  (x-y)/2-q+b/2  where 

o 

k=0  or  1.  Then  z=«(xiy)  /4>  - 

?  2  P  o 

<(x*y)  /4>  >  :<<v»kv*b  /4>  -<q*kv+k  /4>  > 
P?P  ?  P  /  PP 

=<<v+kv>p+k  /4-<q*kq>p-b  /4>p=<$( s  )- 

♦(s>p 


which  in  modular  form,  becomes 

<xy>p=<4)(s+)-4.(s*)>p  2. 

where  i(i(s)=<s2>p  with  sf=(x+y)/2  and  s“=(x-y)/2. 

At  first  glance  this  algorithm,  which  snail  be 
referred.to  as  a  minimal  memory  modular  multi¬ 
plier  (fr),  would  appear  to  be  counterproductive 
with  respect  to  a  memory  conservation  metric.. 

The  memory  requirements  associated  with  the  Mr 
will  be  shown  to  be  substantially  less  than 
those  of  direct  mechanizations.  First,  it  should 
be  apparent  that  the  integer  s*  and  s"  found 
in  equation  2  is  bounded  from  above  by  2n*  . 
Therefore,  only  a  (n+1 )-bi t  tabl^  addressing 
space  is  required  to  realize  <J>{ s-)  versus  the 
2n-bit  space  needed  for  direct  architectures. 

It  would  appear  however,  that  there  is  an 
exception  to  this  rule.  Since  one  of  the  moduli 
chosen  is  p=2n+l.  Here  the  maximal  value  of 
s*(or  s')  is  ?.ntl  which  hould  technically 
require  a  (n+2)-bit  address.  However,  by  using 
the  protocol  found  in  figure  1,  the  table  size 
can  be  reduced  to  2n+‘  words  for  all  moduli. 

Here,  the  overflow  bit  serves  to  differentiate 
;  si=0  from  2n 

The  M4  system  architecture  is  abstracted  in 
.‘Figure  2.  This  uses  2n+l  word  high-speed  memory 
I  for  modular  arithmetic  look-up  operations.  Using, 
|  for  example,  the  previously  referenced  4K-30ns 
{device,  moduli  having  an  U-bit  dynamic  range 
(vs.  6-bit  in  the  direct  form)  can  be  mechanized. 
-This  would  yield  a  three-moduli  dynamic  range  on 
the  order  of  23l '  'K5.6-10  •  That  is,  without 
an  increase  in  memory  size  (and  therefore  access 
time),  the  dynamic  range  of  the  FT*  is  2^3/218=2 ' 5 
times  larger  than  that  obtainable  through  direct 
means!  This  large  increase  in  dynamic  range 
makes  the  RNS  a  viable  alternative  to  traditional 
filter  design  methods.  Both  improved  precision 


As  a  result,  the  parallel  architecture  is  equiva¬ 
lent  to  that  shown  in  figure  4. 


Modulo  p  Adder 

The  multiplier  requires  a  modulo  p  adder  be 
used  to  combine  the  two  component  parts  of  the 
solution  (namely  $(s+)  and  $(s")).  Modulo  p 
adders  pose  an  interesting  design  problem. 

Unless  a  fast  modulo  p  adder  can  be  fabricated, 
the  overhead  associated  with  addition  will  offset 
any  gain  in  throughput  achieved  through  table 
look-ups.  For  the  moduli  chosen,  2n-l,  2n,  and 
2n*l,  only  the  modulo  2n  adder  can  be  realized 
directly  (n-bit  adder  with  ignored  overflow).  It 
would  however,  be  desirable  to  use  a  n-bit  adder 
to  realize  the  modulo  2n-l  and  2nil  adder  as  well. 


Using  n-bit  AND  gates  to  sense  the  zero  condition 
of  <s>2N,  the  overflow  bit  OVF,  and  the  sign  bits 
of  $(s+)  and  4>{s'),  a  combinational  logic 
routine  can  be  defined  which  will  convert  ss^N 
into  <s>  .  It  can  be  noted  that  the  mapping4 
requirements  are: 


1.  for  p=2N-l,  map  s  to  s  or  s-2^+l=«S>  N+l>  N 


2.  for  p-2  ,  map  s  to  $• 

3.  for  p=2NH,  map  s  to 


-2N-<s> 


N 


s  or  s-2n-1=«S'  n-1>  n 

n  2  2 

Suppose  the  moduli  ps2n+l,  n-12,  is  to  be  imple¬ 
mented.  By  using  two  cownerciaily  available 
16x9  PLA's  in  parallel,  the  12-bit  outcome  of  an 
n-bit  adder  and  the  four  control  bits,  can  be 
converted  to  13-bit  mask.  The  mask  would  trans¬ 
form  the  output  of  a  high-speed  n-bit  adder  to 
s  or  s-2n-l,  depending  on  the  state  of  the  4 
control  bits.  Based  on  a  25-ns  12-bit  Schottky 
look-ahead  adder,  a  20-ns  16x9  PLA,  and  10-ns 
FCT  mask  switches.a  65-ns  modulo  p  adder,  for 
p-2"-l,  2n,  arid  2n+l  can  be  realized.  The 
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presence  of  a  65-ns  modulo  p  adder  will  now  allow 
a  1-10-ns  large  moduli  residue  multiplier  to  be 
based  on  35-ns  JKxl  HHOS  memory  units,  for  a 
moduli  set  (2'‘-l,  2  ,  2'^+l},  a  fixed  point 

multiplier,  having  an  output  dynamic  range  of 
2i6-2  ,  can  thus  be  fabricated  having  a  word 

rate  of  7.143M  multiplications  per  second  or 
23. 5M  multiplications  per  second  if  a  pipe¬ 
lined  architecture  is  used. 

Sannary: 

The  residue  number  system  offeiS  the  potential 
for  high  speed  parallel  arithmetic.  This  class 
of  arithmetic  has  been  demonstrated  to  be  useful 
in  designing  recursive  algorithm,,  transforms, 
and  digital  filters.  One  of  the  principal 
limitations  to  its  use  is  its  limited  practical 
dynamic  range.  To  overcome  this  problem,  a 
large  moduli  multiplier,  for  the  moduli  set 
{2n-l,  2n,  2n+U,  was  designed.  This  high-speed 
large  moduli  system  was  the  product  of  the  new 
algorithm  and  new  technologies  (RAH  and  PlA's). 
The  practical  residue  multiplier  is  capable  of 
supporting  a  pipelined  execution  rate  of  28. 5K 
multipliers  per  second. 

References 

1.  G.A.  Jullien,  "Residue  Number  Scaling  and 
Other  Operations  Using  ROM  Arrays,”  IEEE 
Trans.  Computers,  Vol  C-27,  No.  4, 
pp  325-337,  April  1978. 


Figure  1 


2.  C.H.  Huang  and  F.J.  Taylor,  "Memory  Figure  2 

Compression  Scheme  For  Modular  Arithmetic, 
to  be  published,  IEEE  ASSP. 


3.  M.A.  Soderstrar.d,  "A  High  Speed  Low-Cost 
Recursive  Filter  Using  Residue  Arithmetic," 
Proc.  IEEE,  Vol.  65,  No.  7,  July  1977. 


4.  N.S.  Szabo  and  R.l.  Tanaka,  Residue  Arithme¬ 
tic  and  Its  Applications  To  tomputer 
Technology.  McGraw-Hill,  T§67. 

5.  W. K.  Jenkins  and  B.J.  Leon,  "The  Use  Of 
Residue  Number  System  in  the  Design  of 
Finite  Impulse  Response  Filters,"  IEEE 
CSS,  CA5-24,  April  1977. 


rrctulo  p  aJ>i tr 


1 

't  -»■ 

•*s  -  ) 

kvj  IdiMs 

UMs  ! 

)>  eod  p 

SOjfNTSAL 

AflCHITtmUE 


6.  W.K.  Jenkins,  "Techniques  for  Residue-to- 
Analog  Conversion  for  Residue-Encoded 
Digital  Filters,'  ittC  CSS,  CA5-25, 

July  1978. 

7.  A.  Poled  and  8.  liu,  "A  New  Hardware 
Realization  of  Digital  filters,”  IEEE  ASSP, 
ASSP- 22,  December  1974. 

8.  W.J.  Jenkins,  "A  Highly  Efficient  Residue- 
Combinatorial  Architecture  for  Digital 
Filters,"  Proc.  of  the  IEEE  (Letter), 

Vol.  66,  No.  6,  June  1978,  pp  700-702. 


This  work  was  partially  supported  under  AFOSR 
Grant  No.  AF49620-79-C-0066. 


Figure  3 


794 


Figure  4 


795 


LARGE  KOIJUI  I  MULT! PLIERS 
FOR  SIGNAL  PROCESSING 


by 


F.J.  Toy  lor 

lit) i y o rs i  Ly  of  C i nc i nna L i 


Abstract 

The  residue  number  system  has  recently  been  shown  to  be  a 
viable  signal  processing  media.  However,  i t  does  possess  limita¬ 
tions.  One  of  the  most  serious  is  overflow  prevention  through 
magnitude  scaling.  One  method  of  overcoming  this  defect  is  to 
increase  the  dynamic  range  of  the  numbering  system.  To  this  end 
a  new  high-speed  large  moduli  multiplier  has  been  developed.  The 
multiplier,  which  is  the  result  of  combining  the  quarter  squared 
algorithm  with  recent  breakthroughs  in  device  technology.  As 
a  result,  equivalent  10-bit  full  precision  products  can  lie  obtained 
at  a  pipelined  rate  of  20. 5M  multiples  per  second. 


This  work  was  partially  supported  under  AFOSR  grant  M9620-79-C-0066. 


1  •  INTRODUCTION 

Recently,  the  residue  number  system  (RNS)  has  received  renewed  attention 
in  the  literature  [1-3].  This  mathematically  mature  study  was,  until  the 
present,  in  the  background  of  digital  system  design  because  of  its  historic  in¬ 
ability  of  digital  hardware  to  support  RNS  arithmetic  fT|.  However,  recent 
breakthroughs  in  the  area  of  read-only  memory  technology  has  significantly 
altered  tin's  case.  Using  high-speed  bipolar  ROM's,  tbc  ability  of  the  RNS 
to  support  ultra  high-speed  digital  filtering  has  been  experimentally 
demonstrated  [5].  The  question  of  dccima  1-to-residuc  I/O  operations  has  also 
been  addressed  [6].  However,  a  major  obstical  to  the  cause  of  RNS  filtering 
has  been  register  overflow  protection.  In  order  to  guarantee  that  system 
registers  do  not  overflow  during  run-time,  an  inefficient  operation  referred 
to  as  scaling  has  to  be  performed.  If  scaling  were  not  required,  RNS  filters 
was  shown  to  possess  higher  throughput  rates  than  those  obtainable  using 
distributed  arithmetic  (ie:  bit-slice;  ref  [7])  [8J.  However,  when  scaling 
is  required,  the  RNS  architecture  was  shown  to  be  at  a  disadvantage.  It 
should  be  remembered  however  that  the  distributed  arithmetic  filter  is  a 
constant  coefficient  device  (ie:  shift-invariant)  whereas  the  linear  RNS 
filter  is  general  (ie:  variable  coefficient).  Therefore  the  RNS  provides 
the  user  with  the  versatility  needed  to  perform  adaptive,  optimal  (ex: 
minimal  variance  in  a  non-stationary  stochastic  environment),  frequency 
tuneable  filtering  which  cannot  he  supported  in  a  bit-slice  configuration. 

In  this  work,  a  new  multiplier  architecture  is  developed  which 
significantly  enhances  the  case  for  RNS  filters  i>y  significantly  reducing 
scaling  overhead.  The  high-speed  residue  multiplier  will  be  shown  to 
ir.crease  the  dynamic  range  of  the  RNS  to  a  value  which  either  reduces  the 
number  of  scaling  operations  to  a  small  fraction  of  their  original  number 
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or  make  scaling  unnecessary.  All  this  is  accomplished  without  increasing 
the  memory  budget  over  above  that  found  in  contemporary  RHS  designs. 

1 1 .  RMS  OVERVIEW 

Interest  in  the  RrJS  is  due  to  its  ability  to  perform  high-speed  arith¬ 
metic.  Speed  is  achieved  through  the  use  of  a  high  degree  of  parallelism 
and  an  absence  of  carry  information  requirements.  These  two  attributes  are 
a  byproduct  of  the  fact  that  there  does  not  exist  a  most  (least)  significant 
residue  digit.  That  is,  all  residue  digits  are  of  equal  importance.  More 
specifically,  if  P  is  a  moduli  set  such  that  P  =  {p, ,...,p^},  and  the  p.'s 
are  relatively  prime,  then  if  xe  [-M/2,  M/2),  x  is  uniquely  represented  by 
the  L- tuple 

x-Kx,,....xl)  1. 


wi  th 


X,  *, 


) .-< j x j >  •  otherwise  \ 

.1  P1  J 


2. 


where  <x>  demotes  x  modulo  p.  and  M  =  IE  p..  The  bilinear  composition  of 
pi  1  i=l  ‘ 

two  integers,  say  x->(xj ,Xj  )  and  y>(y^ ,y^) ,  is  given  by  x*y  (where 
o  denotes  addition,  subtraction,  or  multiplication)  is  given  by 

x°  y-'(x1<>y1,...,xL»yL).  3. 

It  can  be  seen  each  residue  digit,  namely  x^y-  can  be  computed  inde¬ 
pendent  of  all  others  (ie:  no  carry  infoniiallnn  requirements).  In  practice, 
the  mapping  of  Xj  and  y.  into  x-oy-  is  accomplished  using  table  lookups  where 
the  table  residue  on  randomly  accessed  read-only  memory.  Typical  high-speed 
memory  modules,  which  are  currently  available,  are: 


Device 

Type 

Technology 

Configuration 

Access-Speed 

10149 

ROM 

EC1. 

256x4 

20  ns 

SN54S 

ROM 

TTL 

1024x4 

35  ns 

214711-1 

RAM 

HMDS 

4096x 1 

30  ns 

212511-1 

RAM 

UMOS 

1024x1 

20  ns 

12167 

n  *»  ** 

HMDS 

16384;<i 

45  ns 

Thf’  product,  of  two  res idues  modulo  p.,  p.  •  21’  con  I»<*  precomputed  <ind 
stored  in  a  2l!ixn-bil  memory  uniL  where  m-2n.  Using  a  large  existing  high¬ 
speed  memory  {4Kx1  at  30  ns),  residues  having  up  to  six  bit  integer  values 
can  be  used  (ex:  P  =  {64,63,...}).  Thus,  fixed-point  multipliers  having  a 
dynamic  range  of  [-M/2,M/2)  can  be  architected  which  have  execution  rates 
in  the  low  nanoseconds. 

The  disadvantages  of  the  residue  number  systems  are  manifold.  Since 
the  RMS  possess  no  most  significant  digit,  decimal  to  residue  conversion, 
division,  magnitude  comparison,  and  arithmetic  shift  operations  are  clinnbcr- 
some  and  should  be  avoided.  Register  overflow,  due  to  its  finite  dynamic 
range,  impose  a  severe  constraint  on  the  RNS  operations.  Unlike  weighted 
numbers  (decimal,  binary,  etc.)  where  rounding  or  truncating  least  significant 
digits  can  control  overflow,  such  is  not  the  case  in  the  RNS.  Since  there 
is  an  absence  of  least  significant  digits,  the  more  general  and  inefficient 
operation  known  as  scaling  must  be  used.  Since  scaling  is  a  form  of  division, 
its  use  should  be  discouraged.  To  gain  insight  into  this  problem,  consider 
the  inner  product  of  two  31-d jnionsional  real  vectors  x  and  y  whose  entries 
arc  encoded  as  residue  digits  with  respect  to  P  =  {32,31,29,271.  Without  scal¬ 
ing,  the  worst-case  value  of  x  and  y  would  be  limited  to  V'^  where 
V  =  (M/2)/31  =  25056.  Therefore,  to  insure  that  no  worst  case  overflow 

r  1 

can  occur,  a  7.3-bit  (ie:.  V  s  158  -  2  )  dynamic  range  limitation  must 

be  imposed  on  x  and  y.  With  scaling,  larger  input  ranges  can  be  used  at 


the  expense  of  statistical  accuracy  in  the  output  space  (analogous  to 

roundoff  errors). 

Due  to  the  dynamic  ranee  limitation  of  RfiS  systems,  one  is  generally 
forced  to  accept  one  of  the  following  two  overflow  prevention  strategies. 


.  Increase  the  dynamic  ranye  lo  a  sufficiently  i,;ryo  value  iiy  ridding 
more  model  i  to  P.  or 


2.  Make  scaling  a  more  efficient  operation. 

The  first  option  represents  a  brute  force  attack  to  the  problem.  Such 
an  approach  will  increase  to  cost  and  complexity  metrics  of  a  filter.  In 
addition,  the  moduli  set  P  must  be  tailored  to  unique  filter.  The  other 
approach  appears  to  L?  the  most  popular  at  this  time.  Sazl>o  and  Tanaka, 
and  others,  hav'e  concentrated  on  the  scaling  efficiency  through  the  choice 
of  the  three-tuple  moduli  set  P  =  { 2n- 1 ,  2n,  2n+U.  This  moduli  set  has 
the  ability  to  efficiently  scale  a  residue  number  by  any  one  of  the  chosen 
moduli.  However,  there  is  an  intrinsic  limitation  plauging  this  method  and 
it  is  its  dynamic  range.  Using  a  large  high-speed  memory  unit,  say  4Kxl, 
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the  input  addressing  space  is  limited  to  2  .  This  means  that  a  moduli  p. 

* 

is  technically  limited  to  p.  _<  2°  (ie:  x--y-  <  r).  Therefore,  the  dynamic 
range  of  any  modular  operation  is  given  by  M  =  (2n-l )(2n)(2!,+l )  a,  23n  -  2*°. 
In  many  applications,  an  18-bit  resolution  is  insufficient  resolution. 


III.  Principal  Result 

It  is  dcsiraolc  Lo  keep  the  previously  discussed  three  moduli  structure 
for  purposes  of  potential  scaling  needs.  However,  in  order  lo  overcome  the 
"existing  disadvantages  of  this  sys teir,j tha t  of  dynamic  range,  a  new  approach 
is  called  for.  Since  it  is  unrealistic  to  assume  substantially  larger 
density  high-speed  memories  will  continue  to  become  available,  it  is 
incumbent  that  more  memory  efficient  . es , due  arithmetic  unit  be  designed. 


An  efficient  n  1  fieri  thin,  which  is  ideally  suited  for  this  application,  is 
known  as  the  quarter-square  multiplier  [9-11]. 


Over  a  real  field  it  is  obvious  that 

xy  =  ( ( x »y ) / 2 ) 2  -  ( (x-y)/2]2  A. 

which  in  modular  form,  it  becomes 

<xy>p  =  <<l'(sl‘)-<|.(s‘)>p  5. 

o  4-  • 

where  q* ( s )  =  <s  >  with  s  =  (x+y)/2  and  s'=  (x-y)/2. 

The. quarter- squared  multiplier  has  been  studied  by  J.M.  Pollard  (1976)  in  a 
Galois  field.  Questions  of  hardware  implementation  were  not  considered  and, 
due  to  the  Galois  field  limitation,  only  prime  moduli  could  be  considered. 

II.  Nussbaumer  (1976)  studied  the  quarter-square  multiplier  over  real  fields 
for  use  in  ROM  intensive  digital  filters.  Soderstrand  and  Fields  (1977)  made 
brief  reference  to  this  multiplier  for  residue  arithmetic  but  offered  no  satis¬ 
factory  hardware  realization,  fn  this  paper,  a  practical  residue  arithmetic 
quarter-squared  multiplier  will  be  architected  usinq  commercially  available 
hardware. 

A  problem  that  would  seem  to  plauge  the  quarter-square  multiplier  is  the 

need  to  realize  the  division  by  two  the  sums  and  differences  found  in  Eq.  4. 

-1  -f 

In  yeneral ,  the  existence  of  an  N  ,  such  that  <N  N>  =1,  can  only  be  guaranteed 
if  il  is  realitively  prime  to  p.  Since  one  of  the  chosen  moduli  is  p=2n,  multi¬ 
plicative  inverse  of  2  cannot  be  guaranteed  to  exist.  Therefore,  equation  4 
cannot  he  interpreted  as  thi?  equation  <-£1//l  iy)  '-(x-y)  '>J} ->J} | .  The  poten¬ 

tial  problem  of  dividing  the  sum  of  differences,  found  in  equation  4,  by 
2,  will  be  explicitly  and  efficiently  treated  for  the  first  time  later  in  this 
•paper.  For  a  2111  word  memory  unit,  the  direct  product  architecture  (ie.:  xy) 
would  limit  the  maximal  moduli  to  he  hounded  by  2n,  n=m/2.  In  fact,  this 
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claim  can  bo  extended  (o  Uio  case  when  p  =  2nH  through  use  of  Lite  following 
modification.  Disserve  that  if  =  0,  then  it  automatically  follows  that 
< x i Y i > p  =  0*  Therefore,  if  xi=0-v0^0...0  ('which  is  detectable  condition  in 
that  the  (n+l)st  hit  and  remnininq  n-bit  block  is  zero  (0->G  00. . .0) )  the  out¬ 
put  register  would  he  automatically  cleared.  Therefore,  the  lookup  table  need 
not  be  accessed  for  Ibis  case.  Instead,  the  all  zero  n-bit  portion  of  the 
table  address* allocated  to  x^,  can  be  used  to  represent  x^=2n  where  = 
2n->1.00„ . .0  (see  Figure  1).  Here,  the  (able  would  be  programed  to  map 

Y-j  into  <2ny.>  using  only  a  2m  word  memory. 
pi 

The  memory  requirements  associated  with  the  quarter-square  multiplier  are 
substantially  less  than  those  of  direct  mechanizations.  First,  it  should  be 
apparent  that  the  integers  s+  <>ncl  found  in  equation  5,  are  bounded  from 
above  by  2n'\  Therefore,  only  a  (n+l)-bit  table  addressing  space  is  required 
to  realize  (s“)  versus  the  2n-bit  space  needed  for  direct  architectures.  It 
would  appear  however,  that  there  is  an  exception  to  this  rule.  Since  one 

of  the  moduli  chosen  is  p  =  2n+l .  Here  the  maximal  value  of  sh(or  s’)  is  2n+^ 
which  would  technically  require  «  (iu2)-bit  address.  However,  by  using  the 
protocol  found  in  Figure  2,  which  is  an  adaptation  of  the  network  found  in 

h>|<  1 

Figure  1,  the  table  size  can  be  reduced  to  2'  words  for  all  moduli.  Here, 
the  overflow  bit  serves  to  differentiate  s^-O  from  2n+^. 

The  quarter-squared  architecture  is  abstracted  in  Figure  3,  It  uses  a 
2nl^  word  high-speed  memory  for  modular  arithmetic  lookup  operations.  Using, 
for  example,  the  previously  referenced  4K-30ns  device,  moduli  having  an  11-bit 

dynamic  range  (vs.  6-bit  in  the  direct  form)  can  be  mechanized.  This  would 
yield  a  three-modul i -dynamic  range  on  the  order  of  2'n‘  _  8.6-10.  That 

is,  without  an  increase  in  memory  size  (and  therefore  access  time),  the 
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dynamic  range  of  the  quarter-squared  is  2'*'  12  =  2  1  times  larger  than  that 

obtainable  through  direct  means!  This  large  increase  in  dynamic  range  makes 
the  RNS  a  viable  alternative  to  traditional  filter  design  methods.  Both 
improved  precision  and  throughput  (through  the  reduction  or  absence  of  tradi¬ 
tional  scaling  operations)  can  be  achieved. 

Several  versions  of  the  multiplier  algorithm  can  be  considered.  They  are 
summarized  in  Figure  4.  The  first,  called  the  sequential  form,  would  have 
an  estimated  throughput  rate  of  240  ns  based  on  a  60  ns  lookahead  adder  and 
memory  having  an  access  time  of  30  ns  with  a  cycle  time  of  60  ns.  The  second 
architecture,  called  the  parallel  form,  would  run  at  a  100  ns  rate.  The 
parallel  architecture  is  preferred  because  its  higher  speed,  simpler  control. 

A  60  ns  pipelined  execution  rate  can  he  purchased  at  a  small  hardware  cost. 

Example:  p  =  2^  =  2048,  x  =  1040,  y  =  352,  then 
z  =  <xy>  =  1536 

r 

Al:s+  =  1376;  <|»(s+)  =  <484416>p  =  1088; 

Al:s~  =  688;  <|>(s~)  =  <118336>p  =  1600; 

A2  =  <'1’(5!  )-'|i  (S_ )  >  =  <-512>()  =  1536 

II 

Upon  closer  investigation  of  the  table  lookup  data  base,  a  potential 
nuisance  can  be  found.  It  can  be  examplified  by  observing  that  if  s~  =  9, 
p  =  32,  then  ^(s*)  =  <9^//i>32  =  20.25.  Therefore,  it  may  be  required  that 
two  additional  fractional  bits  may  need  to  be  added  to  the  table's  output 
wordlongfh.  However,  this  is  not  the  case  as  suggested  by  the  following 
theorem: 

Theorem:  Let [[v]]  denote  the  integer  value  of  v.  Then  z  -  ^)(<|>(s  )B-[]*|>(s 
That  is,  only  the  integer  value  of  «j>  need  be  used  arid  the  fractional  bits  of 
<|>(s~)  can  be  ignored. 


n 


Proof:  for  x,  y  and  k  integers,  one  may  define  two  rational  numbers,  namely 

(x-iy)/2  -  v+k/2;  (x-y)/2  =  b/2  where  k  =  0  or  i.  Then  i  =  «(xry)“/<1>^ 

?  ?  ?  2  2 
-  <(x-y)  7/l>p>p=«v-»kv+b  /4>p-<q+dv+k  /4>  >p  -  «v*kv>p  +  k  /4-qs-k  >p-b  /4>p 

=  «/>( s+)  -  <f(s')>i 

As  a  result,  the  parallel  architecture  is  equivalent  Lo  that  shown  in 
nqurc  !>.  rurthennore,  by  deriving  the  above  theorem  over  a  rational  field, 
and  showing  that  the  results  pertain  to  the  integers,  several 
classical  problemsare  overcome: 

1.  The  quarter-squared  multiplier  is  not  res  trie  ted  to  the  Galois  fields 
suggested  by  Pollard. 

2.  The  question  of  the  existence  of  the  multiplicative  inverse  of  4  is 
now  moot. 


t 


-  i<)  - 
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The  quarter-square  nn.l t i |,1  i or  requires  a  modulo  „  adder  be  used  to  combine 
the  two  component  parts  of  the  solution  (namely  *(s‘)  and  ,f,( ^ ) ) .  p 

adders  pose  on  interesting  design  problem.  Unless  a  fast  modulo  p  adder  can 

be  fabricated,  the  overhead  associated  with  addition  will  offset  any  gain  in 

throughput  achieved  II, rough  . .  lookups.  !„r  the  moduli  chosen,  Al,  p", 

and  2  H,  only  the  modulo  2"  adder  can  be  realized  directly  (n-bit  adder 

with  ignored  overflow),  it  would  however,  be  desirable  to  use  a  „.|,it  adder 

to  realize  the  modulo  2"-l  and  2nH  adder  as  well.  For  the  purpose  of  clarity, 

let  s  be  defined  to  be  the  sum  of  ,|.(s+)  and  +(S-J.  The  following  observation 
then  follows: 


TABLE 

1 

1 

Case 

Dynamic  Inlecjer 
Range  of  S 

Module 

<s>2M 

)  2^  Adder 
OVF-BIT 

Modulo 

'Ji 

1 

s=0 

0 

0 

2fl-l 

2 

1<s<2N-2 

s 

0 

2N-1 

3 

s=2m-i 

s 

0 

2N-1 

4 

s=2N 

0 

1 

2«-l 

5 

2N+1<s<2U-4 

s-2N 

1 

2N-1 

6 

s=0 

0 

0 

2N 

}  7 

1 <s<2N-1 

S 

0 

2U 

i  n 

s=2N 

0 

i 

1 

N 

2 

9 

2N+1<s<2N-2 

s-2N 

1 

2n 

i  . 

10 

s=0 

0 

0 

2NH 

11 

12 

1<s<2ff-l 

s=2fl 

s 

0 

0 

1 

2N+1 

2M+1 

' 

13 

2N+I<s<2N+1-l 

s-2N 

1 

2N+1  £ 

14 

,  rv-^1 

(s.oneiaj  c.vu') 

0 

•  ?Nn  s 

p.  Adder 
's'0i 


0 

s 

0 


s-2N+1 
s-2N+1 


0 

s 

0 

s-2 

0 

S 

S 


tl 


,N 
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Exam|>lc:N=3 
s  <s> 

!)i 


0 

0 

4 

4 

7 

0 

8 

1 

10 

3 

0 

0 

4 

4 

0 

0 

10 

2 

0 

0 

4 

4 

8 

8 

10 

1 

If. 

7 

Using  n-bil  AND  gates  to  sense  tlie  zero  condition  of  --s^N,  the  overflow 

bit  OVr  the  sign  bits  of  <[>(s+)  and  <|.(s"),  combina ticna I  logic 

can  bo  defined  which  will  <s>pN  into  <s>  .  It  can  be  noted  from  the 

c-  D  j 

data  found  in  Table  1  that  the  mapping  requirements  are: 

1.  for  p  =  2!,-l,  map  s  to  5  or  s-2!*H  =  «s>  n ?  1  >  M 

?  2" 

«  ,  Jl  .  Jl 

2.  for  p  -  2  ,  map  s  to  s-2  =  <s> 

3.  for  p  =  2^+1,  map  s  to  s  or  s-2^-1  =  '<S>  ,,-l>  ,, 

2  2 

Mapping  two  is  trivially  satisfied  with  an  n-bit  adder.  The  other  two 
mappings  require  that  s  remains  unchanged  or  it  is  decremented  or  incremented 
by  unity.  There,  are  several  ways  to  approach  this  problem.  Bioul  ,  Davis,  arid 
Quisquater  have  presented  an  unorthodox  architecture  for  a  modulo  ( 2n- 1 )  adder 
using  two-input  gatcs^^.  Modulo  (2nil)  adders  can  also  be  realized  through 
the  use  of  end-around-carr ies.  However,  compared  to  modulo  211  addition,  this 
approach  would  almost  double  the  addition  delay.  This  extended  delay  problem 
can  be  overcome  through  added  complexity  ( i e :  time  multiplexing  two  ond-around- 
carry  adders).  Mapping  one  and  three  can  be  efficiently  realized  in  the  manner 
suggested  by  the  example  found  in  Appendix  A.  The  functional  operation  of 
adding  one  (mapping  1)  or  subtracting  one  (mapping  3)  from  the  output  of  an 
n-bit  adder  is  performed  by  a  PLA.  The  PLA  will  provide  an  overlay  mask  which 
accomplishes  the  required  task.  The  derivation  and  utility  of  the  mask  can  be 
understood  in  the  context  of  the  following  example.  Example:  Suppose  s  is  an 
11-bit  word  having  a  decimal  value-  of  s^  =  92  or  s,,  -  00001011100.  If  s^-1  = 
91  or  (sl£,-l  )2  *^0000101  fjoi  I  is  desired,  one  notes  that  only  the  3-1. SB’s  of  S2 
need  be  altered.  In  general,  for  n=12,  only  the  following  13  distinct  binary 
masks  are  required  to  form 


MS!?  oat  lorn  LSI! 

X  X  X  X  X  X  X  X  X  X  X  X 
XXXXXXXXXXXO 
X  X  X  X  X  X  X  X  X  X  0  1 

x  o  i  i  i  i  i  i  i  i  i  i 

0  1  1!  1  I  I  1  I  1  1  I 


No  La  L i on 

X  =  leave  corresponding  bit 
location  of  s-j  unchanged  l(or  0) 
=  change  corresponding  bit 
location  of  to  1  (or  0) 

Table  11.  MASK 


») 

Suppose  the  moduli  p  *  2  Tl ,  n  =  12,  is  to  be  implemented.  I?y  using 
two  commercial ly  available  16x9  Pl.A's  in  parallel,  the  12-bit  output  of  an 
n-bit  adder  (shown  as  <s>  ...  in  Table  I)  and  the  four  previously  specified 

r 

control  bits,  can  be  converted  to  13-bit  mask.  The  mask  would  transform  the 

output  of  a  high-speed  n-bit  adder  to  s  or  s-2n-l,  depending  on  the  state  of 
the  4  control  bits.  Based  or-  a  25-ns  12-bit  Schottky  lookahead  adder,  a 
20- ns  16x9  PLA,  and  10-ns  fT.T  mask  switches  (in  notation  coninents  of  Table  II) 
a  65-ns  modulo  p  adder,  for  p  =  2n-l,  2n,  and  2n+l  can  be  realized.  The 
presence  of  a  65-ns  modulo  p  adder  will  now  allow  a  140-ns  large  moduli 
residue  multiplier  based  on  35-ns  4Kxl  HMDS  memory  units.  (See  Figure  5) 

For  a  moduli  set  12  -1,  21',  2  ’H),  a  fixed  po i n t  multiplier,  having  an 

output  dynamic  range  of  236  -  2lc,  can  thus  be  fabricated  having  a  word  rate 
of  7.143  M  multiplications  per  second.  Tin’s  compares  favorably  with  new 
16x16  VLSI  multipliers.  Using  a  pipelined  architecture,  which  requires  the 
insertion  of  the  storage  registers  found  in  Figure  5,  a  very  impressive 
throughput  figure  of  2B.!3Hmul  tipi  ications  per  second.  It  is  .important,  and 
fortunate  to  realize  that  the  Itel  IIMOS  memory  unit,  used  in  this  analysis, 
has  a  cycle  time  equal  to  the  access  time.  If,  as  is  often  found  in  practice, 
a  memory  unit  has  a  cycle  time  approximately  twice  the  access  time,  then 
pipeline  delay  would  increase  from  35-ns  to  7()-ns. 


Summary: 


1  ! 

The  residue  number  system  offers  the  potential  for  high  speed  parallel 
ari  Llimelic.  This  class  of  arithmetic,  has  been  (lemons  Ira  ted  to  he  useful  in 
designing  recursive  algorithms,  transforms,  and  digital  filters.  One  of 
the  principal  limitations  to  its  use  is  its  limited  practical  dynamic,  range. 

To  overcome  this  problem,  a  large  moduli  multiplier,  for  the  moduli  set 
<?",  2,'ili,  was  designed.  Ihis  high-speed  large  moduli  system  was 
the  product  of  the  novel  algorithm  and  new  technologies  (RAM  and  PLA's}. 

The  practical  residue  multiplier  is  capable  of  supporting  a  pipelined 
execution  rate  of  28.5  K  multipliers  per  second. 

Lastly, the  performance  of  the  residue  multiplier  is  noted  to  he  technology 
dependent.  As  memory  densities  increase  and  speed  improve,  the  multiplier  per¬ 
formance  will  directly  benefit.  As  a  result,  the  higher  speeds  associated 
with  the  next  generation  of  submicron  technology  devices  can  provide  a  speed-up 
of  two  to  five.  In  the  more  distant  future,  when  and  if  the  Josephsen  tech¬ 
nology  becomes  a  viable  design  tool,  residue  multiplication  rates,  using  the 
proposed  methodology,  may  approach  500M  multiplication  per  second. 
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APPBNDIX  A: 


An  example  of  a  PI., A  controlled  2n+l  atl(lor,  for  „.3>  )K 
diagrammed  In  Figure  A.l.  j»  this  nK„rc,  the  8„  A=,  M  |1=D 
modulo  (2'Vl)  (ie:  (S*0>  modulo  9=2)  ,*  also,  U,o 

addition  delay  for  n~'2.  based  on  con, morel  r,  I  ly  ava  i  lalilo  hardivaro, 

is  computed  to  be  '!0+20iri=fir;nw„.  .... 

°  b.msoc.  Mio  general  arch  j  Lecture  of  the 

adder  is  diagrammed  in  Figure  A. 2. 


FIGUhF  CAPTIONS: 

Figure  I;  Modulo  2n  +  lAQJ 

Figure  2:  Memory  Compression  for  S-+- 

Figure  3:  Modulo  Multiplier 

Figure  A:  Architectures 

ligure  5:  Fargo  Moduli  Multiplier 

Figure  A.l;  F.xamplo  Problem 

Figure  A. 2:  General  Architecture 
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Figure  1.  Modulo  2n+1  ALU 


Figure  2.  Memory  Compression  For  si 
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Figure  5.  Large  .Moduli  Multipliers 


Fi Sure  A.l: 


Example  Problem 


